diff mbox series

[net-next,1/3] net: provide macros for commonly copied lockless queue stop/wake code

Message ID 20230322233028.269410-1-kuba@kernel.org (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net-next,1/3] net: provide macros for commonly copied lockless queue stop/wake code | expand

Checks

Context Check Description
netdev/series_format warning Series does not have a cover letter
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 18 this patch: 18
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 18 this patch: 18
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 18 this patch: 18
netdev/checkpatch warning CHECK: Macro argument 'get_desc' may be better as '(get_desc)' to avoid precedence issues CHECK: Macro argument 'start_thrs' may be better as '(start_thrs)' to avoid precedence issues CHECK: Macro argument 'stop_thrs' may be better as '(stop_thrs)' to avoid precedence issues WARNING: Avoid unnecessary line continuations WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: macros should not use a trailing semicolon WARNING: memory barrier without comment
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jakub Kicinski March 22, 2023, 11:30 p.m. UTC
A lot of drivers follow the same scheme to stop / start queues
without introducing locks between xmit and NAPI tx completions.
I'm guessing they all copy'n'paste each other's code.

Smaller drivers shy away from the scheme and introduce a lock
which may cause deadlocks in netpoll.

Provide macros which encapsulate the necessary logic.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
rfc: https://lore.kernel.org/all/20230311050130.115138-1-kuba@kernel.org/
 - perdicate -> predicate
 - on race use start instead of wake and make a note of that
   in the doc / comment at the start
---
 include/net/netdev_queues.h | 171 ++++++++++++++++++++++++++++++++++++
 1 file changed, 171 insertions(+)
 create mode 100644 include/net/netdev_queues.h

Comments

Andrew Lunn March 23, 2023, 12:35 a.m. UTC | #1
On Wed, Mar 22, 2023 at 04:30:26PM -0700, Jakub Kicinski wrote:
> A lot of drivers follow the same scheme to stop / start queues
> without introducing locks between xmit and NAPI tx completions.
> I'm guessing they all copy'n'paste each other's code.
>
> Smaller drivers shy away from the scheme and introduce a lock
> which may cause deadlocks in netpoll.

Hi Jakub

I notice there is no patch 0/X. Seems like the above would be good
material for it, along with a comment that a few drivers are converted
to make use of the new macros.

   Andrew
Jakub Kicinski March 23, 2023, 1:04 a.m. UTC | #2
CC: maintainers, in case there isn't a repost
https://lore.kernel.org/all/20230322233028.269410-1-kuba@kernel.org/

On Thu, 23 Mar 2023 01:35:34 +0100 Andrew Lunn wrote:
> On Wed, Mar 22, 2023 at 04:30:26PM -0700, Jakub Kicinski wrote:
> > A lot of drivers follow the same scheme to stop / start queues
> > without introducing locks between xmit and NAPI tx completions.
> > I'm guessing they all copy'n'paste each other's code.
> >
> > Smaller drivers shy away from the scheme and introduce a lock
> > which may cause deadlocks in netpoll.  
> 
> I notice there is no patch 0/X. Seems like the above would be good
> material for it, along with a comment that a few drivers are converted
> to make use of the new macros.

Then do I repeat the same text in the commit? Or cut the commit down?
Doesn't that just take away information from the commit which will
show up in git blame?

Having a cover letter is a good default, and required if the series 
is a larger change decomposed into steps. But here there is a major
change and a bunch of loose conversions. More sample users than
meaningful part.

LMK what your preference for splitting this info is, I'm unsure.
Yunsheng Lin March 23, 2023, 3:05 a.m. UTC | #3
On 2023/3/23 7:30, Jakub Kicinski wrote:
> A lot of drivers follow the same scheme to stop / start queues
> without introducing locks between xmit and NAPI tx completions.
> I'm guessing they all copy'n'paste each other's code.
> 
> Smaller drivers shy away from the scheme and introduce a lock
> which may cause deadlocks in netpoll.
> 
> Provide macros which encapsulate the necessary logic.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> rfc: https://lore.kernel.org/all/20230311050130.115138-1-kuba@kernel.org/
>  - perdicate -> predicate
>  - on race use start instead of wake and make a note of that
>    in the doc / comment at the start
> ---
>  include/net/netdev_queues.h | 171 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 171 insertions(+)
>  create mode 100644 include/net/netdev_queues.h
> 
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> new file mode 100644
> index 000000000000..64e059647274
> --- /dev/null
> +++ b/include/net/netdev_queues.h
> @@ -0,0 +1,171 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_NET_QUEUES_H
> +#define _LINUX_NET_QUEUES_H
> +
> +#include <linux/netdevice.h>
> +
> +/* Lockless queue stopping / waking helpers.
> + *
> + * These macros are designed to safely implement stopping and waking
> + * netdev queues without full lock protection. We assume that there can
> + * be no concurrent stop attempts and no concurrent wake attempts.
> + * The try-stop should happen from the xmit handler*, while wake up

unnecessary '*' after handler?

> + * should be triggered from NAPI poll context. The two may run
> + * concurrently (single producer, single consumer).
> + *
> + * All descriptor ring indexes (and other relevant shared state) must
> + * be updated before invoking the macros.
> + *
> + * * the try-stop side does not reschedule Tx (netif_tx_start_queue()

double '*' here?

> + *   instead of netif_tx_wake_queue()) so uses outside of the xmit
> + *   handler may lead to bugs
> + */
> +
> +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
> +	({								\
> +		int _res;						\
> +									\
> +		netif_tx_stop_queue(txq);				\
> +									\
> +		smp_mb();						\
> +									\
> +		/* We need to check again in a case another		\
> +		 * CPU has just made room available.			\
> +		 */							\
> +		if (likely(get_desc < start_thrs)) {			\
> +			_res = 0;					\
> +		} else {						\
> +			netif_tx_start_queue(txq);			\
> +			_res = -1;					\
> +		}							\
> +		_res;							\
> +	})								\
> +
> +/**
> + * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
> + * @txq:	struct netdev_queue to stop/start
> + * @get_desc:	get current number of free descriptors (see requirements below!)
> + * @stop_thrs:	minimal number of available descriptors for queue to be left
> + *		enabled
> + * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
> + *		equal to @stop_thrs or higher to avoid frequent waking
> + *
> + * All arguments may be evaluated multiple times, beware of side effects.
> + * @get_desc must be a formula or a function call, it must always
> + * return up-to-date information when evaluated!
> + * Expected to be used from ndo_start_xmit, see the comment on top of the file.

For now,there seems to be three ways of calling *_maybe_stop():
1. called before transimting a skb.
2. called after transimting a skb.
3. called both before and after transimting a skb.

Which one should new driver follow?
Do we need to make it clear here?
Jakub Kicinski March 23, 2023, 3:27 a.m. UTC | #4
On Thu, 23 Mar 2023 11:05:36 +0800 Yunsheng Lin wrote:
> > +/* Lockless queue stopping / waking helpers.
> > + *
> > + * These macros are designed to safely implement stopping and waking
> > + * netdev queues without full lock protection. We assume that there can
> > + * be no concurrent stop attempts and no concurrent wake attempts.
> > + * The try-stop should happen from the xmit handler*, while wake up  
> 
> unnecessary '*' after handler?
> 
> > + * should be triggered from NAPI poll context. The two may run
> > + * concurrently (single producer, single consumer).
> > + *
> > + * All descriptor ring indexes (and other relevant shared state) must
> > + * be updated before invoking the macros.
> > + *
> > + * * the try-stop side does not reschedule Tx (netif_tx_start_queue()  
> 
> double '*' here?

It's a footnote..

> > + *   instead of netif_tx_wake_queue()) so uses outside of the xmit
> > + *   handler may lead to bugs
> > + */
> > +
> > +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
> > +	({								\
> > +		int _res;						\
> > +									\
> > +		netif_tx_stop_queue(txq);				\
> > +									\
> > +		smp_mb();						\
> > +									\
> > +		/* We need to check again in a case another		\
> > +		 * CPU has just made room available.			\
> > +		 */							\
> > +		if (likely(get_desc < start_thrs)) {			\
> > +			_res = 0;					\
> > +		} else {						\
> > +			netif_tx_start_queue(txq);			\
> > +			_res = -1;					\
> > +		}							\
> > +		_res;							\
> > +	})								\
> > +
> > +/**
> > + * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
> > + * @txq:	struct netdev_queue to stop/start
> > + * @get_desc:	get current number of free descriptors (see requirements below!)
> > + * @stop_thrs:	minimal number of available descriptors for queue to be left
> > + *		enabled
> > + * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
> > + *		equal to @stop_thrs or higher to avoid frequent waking
> > + *
> > + * All arguments may be evaluated multiple times, beware of side effects.
> > + * @get_desc must be a formula or a function call, it must always
> > + * return up-to-date information when evaluated!
> > + * Expected to be used from ndo_start_xmit, see the comment on top of the file.  
> 
> For now,there seems to be three ways of calling *_maybe_stop():
> 1. called before transimting a skb.
> 2. called after transimting a skb.
> 3. called both before and after transimting a skb.
> 
> Which one should new driver follow?
> Do we need to make it clear here?

After, the check before is more of a sanity check.
AFAIR netdevice.rst or some other place documents the stopping
expectations.
Pavan Chebbi March 23, 2023, 4:53 a.m. UTC | #5
On Thu, Mar 23, 2023 at 5:01 AM Jakub Kicinski <kuba@kernel.org> wrote:
> +
> +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)             \
> +       ({                                                              \
> +               int _res;                                               \
> +                                                                       \
> +               netif_tx_stop_queue(txq);                               \
> +                                                                       \
> +               smp_mb();                                               \
> +                                                                       \
> +               /* We need to check again in a case another             \
> +                * CPU has just made room available.                    \
> +                */                                                     \
> +               if (likely(get_desc < start_thrs)) {                    \

I am only curious to understand why initializing _res with likely
result and having a condition to cover only the unlikely case, would
not be better.
As in:
    int _res = 0;
    if (unlikely(get_desc >= start_thrs) {
        start_queue()
        _res = -1
    }

> +                       _res = 0;                                       \
> +               } else {                                                \
> +                       netif_tx_start_queue(txq);                      \
> +                       _res = -1;                                      \
> +               }                                                       \
> +               _res;                                                   \
> +       })                                                              \
> +
> --
> 2.39.2
>
Jakub Kicinski March 23, 2023, 5:08 a.m. UTC | #6
On Thu, 23 Mar 2023 10:23:32 +0530 Pavan Chebbi wrote:
> > +               /* We need to check again in a case another             \
> > +                * CPU has just made room available.                    \
> > +                */                                                     \
> > +               if (likely(get_desc < start_thrs)) {                    \  
> 
> I am only curious to understand why initializing _res with likely
> result and having a condition to cover only the unlikely case, would
> not be better.
> As in:
>     int _res = 0;
>     if (unlikely(get_desc >= start_thrs) {
>         start_queue()
>         _res = -1
>     }

I don't think it matters.
Alexander H Duyck March 23, 2023, 4:05 p.m. UTC | #7
On Wed, 2023-03-22 at 16:30 -0700, Jakub Kicinski wrote:
> A lot of drivers follow the same scheme to stop / start queues
> without introducing locks between xmit and NAPI tx completions.
> I'm guessing they all copy'n'paste each other's code.
> 
> Smaller drivers shy away from the scheme and introduce a lock
> which may cause deadlocks in netpoll.
> 
> Provide macros which encapsulate the necessary logic.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> rfc: https://lore.kernel.org/all/20230311050130.115138-1-kuba@kernel.org/
>  - perdicate -> predicate
>  - on race use start instead of wake and make a note of that
>    in the doc / comment at the start
> ---
>  include/net/netdev_queues.h | 171 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 171 insertions(+)
>  create mode 100644 include/net/netdev_queues.h
> 
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> new file mode 100644
> index 000000000000..64e059647274
> --- /dev/null
> +++ b/include/net/netdev_queues.h
> @@ -0,0 +1,171 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_NET_QUEUES_H
> +#define _LINUX_NET_QUEUES_H
> +
> +#include <linux/netdevice.h>
> +
> +/* Lockless queue stopping / waking helpers.
> + *
> + * These macros are designed to safely implement stopping and waking
> + * netdev queues without full lock protection. We assume that there can
> + * be no concurrent stop attempts and no concurrent wake attempts.
> + * The try-stop should happen from the xmit handler*, while wake up
> + * should be triggered from NAPI poll context. The two may run
> + * concurrently (single producer, single consumer).
> + *
> + * All descriptor ring indexes (and other relevant shared state) must
> + * be updated before invoking the macros.
> + *
> + * * the try-stop side does not reschedule Tx (netif_tx_start_queue()
> + *   instead of netif_tx_wake_queue()) so uses outside of the xmit
> + *   handler may lead to bugs
> + */
> +
> +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
> +	({								\
> +		int _res;						\
> +									\
> +		netif_tx_stop_queue(txq);				\
> +									\
> +		smp_mb();						\
> +									\
> +		/* We need to check again in a case another		\
> +		 * CPU has just made room available.			\
> +		 */							\
> +		if (likely(get_desc < start_thrs)) {			\
> +			_res = 0;					\
> +		} else {						\
> +			netif_tx_start_queue(txq);			\
> +			_res = -1;					\
> +		}							\
> +		_res;							\
> +	})								\
> +

The issue I see here is that with this being a macro it abstracts away
the relationship between get_desc and the memory barrier. At a minimum
I think we should be treating get_desc as a function instead of just
passing it and its arguments as a single value. Maybe something more
like how read_poll_timeout passes the "op" and then uses it as a
function with args passed seperately. What we want to avoid is passing
a precomuted value to this function as get_desc.

In addition I think I would prefer to see _res initialized to the
likely value so that we can drop this to one case instead of having to
have two. Same thing for the macros below.

> +/**
> + * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
> + * @txq:	struct netdev_queue to stop/start
> + * @get_desc:	get current number of free descriptors (see requirements below!)
> + * @stop_thrs:	minimal number of available descriptors for queue to be left
> + *		enabled
> + * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
> + *		equal to @stop_thrs or higher to avoid frequent waking
> + *
> + * All arguments may be evaluated multiple times, beware of side effects.
> + * @get_desc must be a formula or a function call, it must always
> + * return up-to-date information when evaluated!
> + * Expected to be used from ndo_start_xmit, see the comment on top of the file.
> + *
> + * Returns:
> + *	 0 if the queue was stopped
> + *	 1 if the queue was left enabled
> + *	-1 if the queue was re-enabled (raced with waking)
> + */

We may want to change the values here. The most likely case is "left
enabled" with that being the case we probably want to make that the 0
case. I would then probably make 1 the re-enabled case and -1 the
stopped case.

With that the decision tree becomes more straightforward as we would do
something like:
	if (result) {
		if (result < 0)
			Increment stopped stat
			return
		else
			Increment restarted stat
	}

In addition for readability we may want consider adding an enum simliar
to the netdev_tx enum as then we have the return types locked and
usable should we want to specifically pick out one.


> +#define netif_tx_queue_maybe_stop(txq, get_desc, stop_thrs, start_thrs)	\
> +	({								\
> +		int _res;						\
> +									\
> +		if (likely(get_desc > stop_thrs))			\
> +			_res = 1;					\
> +		else							\
> +			_res = netif_tx_queue_try_stop(txq, get_desc,	\
> +						       start_thrs);	\
> +		_res;							\
> +	})								\
> +
> +#define __netif_tx_queue_try_wake(txq, get_desc, start_thrs, down_cond) \
> +	({								\
> +		int _res;						\
> +									\
> +		/* Make sure that anybody stopping the queue after	\
> +		 * this sees the new next_to_clean.			\
> +		 */							\
> +		smp_mb();						\
> +		if (netif_tx_queue_stopped(txq) && !(down_cond)) {	\
> +			netif_tx_wake_queue(txq);			\
> +			_res = 0;					\
> +		} else {						\
> +			_res = 1;					\
> +		}							\
> +		_res;							\
> +	})
> +
> +#define netif_tx_queue_try_wake(txq, get_desc, start_thrs)		\
> +	__netif_tx_queue_try_wake(txq, get_desc, start_thrs, false)
> +
> +/**
> + * __netif_tx_queue_maybe_wake() - locklessly wake a Tx queue, if needed
> + * @txq:	struct netdev_queue to stop/start
> + * @get_desc:	get current number of free descriptors (see requirements below!)
> + * @start_thrs:	minimal number of descriptors to re-enable the queue
> + * @down_cond:	down condition, predicate indicating that the queue should
> + *		not be woken up even if descriptors are available
> + *
> + * All arguments may be evaluated multiple times.
> + * @get_desc must be a formula or a function call, it must always
> + * return up-to-date information when evaluated!
> + *
> + * Returns:
> + *	 0 if the queue was woken up
> + *	 1 if the queue was already enabled (or disabled but @down_cond is true)
> + *	-1 if the queue was left stopped
> + */

I would go with the same here. The most common case should probably be
0 which would be "already enabled" with 1 being woken up and -1 being
stopped. In addition keeping the two consistent with each other would
allow for easier understanding of the two.

> +#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
> +	({								\
> +		int _res;						\
> +									\
> +		if (likely(get_desc < start_thrs))			\
> +			_res = -1;					\
> +		else							\
> +			_res = __netif_tx_queue_try_wake(txq, get_desc,	\
> +							 start_thrs,	\
> +							 down_cond);	\
> +		_res;							\
> +	})
> +

The likely here is probably not correct. In most cases the queue will
likely have enough descriptors to enable waking since Tx cleanup can
usually run pretty fast compared to the transmit path itself since it
can run without needing to take locks.

> +#define netif_tx_queue_maybe_wake(txq, get_desc, start_thrs)		\
> +	__netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, false)
> +
> +/* subqueue variants follow */
> +
> +#define netif_subqueue_try_stop(dev, idx, get_desc, start_thrs)		\
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		netif_tx_queue_try_stop(txq, get_desc, start_thrs);	\
> +	})
> +
> +#define netif_subqueue_maybe_stop(dev, idx, get_desc, stop_thrs, start_thrs) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		netif_tx_queue_maybe_stop(txq, get_desc,		\
> +					  stop_thrs, start_thrs);	\
> +	})
> +
> +#define __netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, down_cond) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		__netif_tx_queue_try_wake(txq, get_desc,		\
> +					  start_thrs, down_cond);	\
> +	})
> +
> +#define netif_subqueue_try_wake(dev, idx, get_desc, start_thrs)		\
> +	__netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, false)
> +
> +#define __netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, down_cond) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		__netif_tx_queue_maybe_wake(txq, get_desc,		\
> +					    start_thrs, down_cond);	\
> +	})
> +
> +#define netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs)	\
> +	__netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, false)
> +
> +#endif
Andrew Lunn March 23, 2023, 9:02 p.m. UTC | #8
On Wed, Mar 22, 2023 at 06:04:06PM -0700, Jakub Kicinski wrote:
> CC: maintainers, in case there isn't a repost
> https://lore.kernel.org/all/20230322233028.269410-1-kuba@kernel.org/
> 
> On Thu, 23 Mar 2023 01:35:34 +0100 Andrew Lunn wrote:
> > On Wed, Mar 22, 2023 at 04:30:26PM -0700, Jakub Kicinski wrote:
> > > A lot of drivers follow the same scheme to stop / start queues
> > > without introducing locks between xmit and NAPI tx completions.
> > > I'm guessing they all copy'n'paste each other's code.
> > >
> > > Smaller drivers shy away from the scheme and introduce a lock
> > > which may cause deadlocks in netpoll.  
> > 
> > I notice there is no patch 0/X. Seems like the above would be good
> > material for it, along with a comment that a few drivers are converted
> > to make use of the new macros.
> 
> Then do I repeat the same text in the commit? Or cut the commit down?
> Doesn't that just take away information from the commit which will
> show up in git blame?
> 
> Having a cover letter is a good default, and required if the series 
> is a larger change decomposed into steps. But here there is a major
> change and a bunch of loose conversions. More sample users than
> meaningful part.
> 
> LMK what your preference for splitting this info is, I'm unsure.

We do seem to have a policy of asking for a 0/X. And it is used for
the merge commit. That is my real point. And i don't see why the text
can be repeated in the merge commit and the individual commits.

    Andrew
Jakub Kicinski March 23, 2023, 10:46 p.m. UTC | #9
On Thu, 23 Mar 2023 22:02:29 +0100 Andrew Lunn wrote:
> We do seem to have a policy of asking for a 0/X. And it is used for
> the merge commit. That is my real point. And i don't see why the text
> can be repeated in the merge commit and the individual commits.

Alright, I'll respin and add a cover. But I think we should be critical
of the rules. The rules are just a mental shortcut, we shouldn't lose
sight of why they are in place and maintain some flexibility.
Jakub Kicinski March 24, 2023, 3:09 a.m. UTC | #10
On Thu, 23 Mar 2023 09:05:35 -0700 Alexander H Duyck wrote:
> > +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
> > +	({								\
> > +		int _res;						\
> > +									\
> > +		netif_tx_stop_queue(txq);				\
> > +									\
> > +		smp_mb();						\
> > +									\
> > +		/* We need to check again in a case another		\
> > +		 * CPU has just made room available.			\
> > +		 */							\
> > +		if (likely(get_desc < start_thrs)) {			\
> > +			_res = 0;					\
> > +		} else {						\
> > +			netif_tx_start_queue(txq);			\
> > +			_res = -1;					\
> > +		}							\
> > +		_res;							\
> > +	})								\
> > +  
> 
> The issue I see here is that with this being a macro it abstracts away
> the relationship between get_desc and the memory barrier. At a minimum
> I think we should be treating get_desc as a function instead of just
> passing it and its arguments as a single value. Maybe something more
> like how read_poll_timeout passes the "op" and then uses it as a
> function with args passed seperately. What we want to avoid is passing
> a precomuted value to this function as get_desc.

The kdocs hopefully have enough warnings. The issue I see with
read_poll_timeout() is that I always have to have the definition 
open side by side to match up the arguments. I wish there was 
a way the test that something is not an lval, but I couldn't
find it :(

Let's see if anyone gets this wrong, you can tell me "I told you so"?

> In addition I think I would prefer to see _res initialized to the
> likely value so that we can drop this to one case instead of having to
> have two. Same thing for the macros below.

Alright.

> > +/**
> > + * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
> > + * @txq:	struct netdev_queue to stop/start
> > + * @get_desc:	get current number of free descriptors (see requirements below!)
> > + * @stop_thrs:	minimal number of available descriptors for queue to be left
> > + *		enabled
> > + * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
> > + *		equal to @stop_thrs or higher to avoid frequent waking
> > + *
> > + * All arguments may be evaluated multiple times, beware of side effects.
> > + * @get_desc must be a formula or a function call, it must always
> > + * return up-to-date information when evaluated!
> > + * Expected to be used from ndo_start_xmit, see the comment on top of the file.
> > + *
> > + * Returns:
> > + *	 0 if the queue was stopped
> > + *	 1 if the queue was left enabled
> > + *	-1 if the queue was re-enabled (raced with waking)
> > + */  
> 
> We may want to change the values here. The most likely case is "left
> enabled" with that being the case we probably want to make that the 0
> case. I would then probably make 1 the re-enabled case and -1 the
> stopped case.

I chose the return values this way because the important information is
whether the queue was in fact stopped (in case the macro is used at the
start of .xmit as a safety check). If stopped is zero caller can check
!ret vs !!ret.

Seems pretty normal for the kernel function called stop() to return 0
if it did stop.

> With that the decision tree becomes more straightforward as we would do
> something like:
> 	if (result) {
> 		if (result < 0)
> 			Increment stopped stat
> 			return
> 		else
> 			Increment restarted stat
> 	}

Do you see a driver where it matters? ixgbe and co. use
netif_tx_queue_try_stop() and again they just test stopped vs not stopped.

> In addition for readability we may want consider adding an enum simliar
> to the netdev_tx enum as then we have the return types locked and
> usable should we want to specifically pick out one.

Hm, I thought people generally dislike the netdev_tx enum.
Maybe it's just me.

> > +#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
> > +	({								\
> > +		int _res;						\
> > +									\
> > +		if (likely(get_desc < start_thrs))			\
> > +			_res = -1;					\
> > +		else							\
> > +			_res = __netif_tx_queue_try_wake(txq, get_desc,	\
> > +							 start_thrs,	\
> > +							 down_cond);	\
> > +		_res;							\
> > +	})
> > +  
> 
> The likely here is probably not correct. In most cases the queue will
> likely have enough descriptors to enable waking since Tx cleanup can
> usually run pretty fast compared to the transmit path itself since it
> can run without needing to take locks.

Good catch, must be a copy paste from the other direction without
inverting the likely.
Alexander H Duyck March 24, 2023, 3:45 p.m. UTC | #11
On Thu, Mar 23, 2023 at 8:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 23 Mar 2023 09:05:35 -0700 Alexander H Duyck wrote:
> > > +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)         \
> > > +   ({                                                              \
> > > +           int _res;                                               \
> > > +                                                                   \
> > > +           netif_tx_stop_queue(txq);                               \
> > > +                                                                   \
> > > +           smp_mb();                                               \
> > > +                                                                   \
> > > +           /* We need to check again in a case another             \
> > > +            * CPU has just made room available.                    \
> > > +            */                                                     \
> > > +           if (likely(get_desc < start_thrs)) {                    \
> > > +                   _res = 0;                                       \
> > > +           } else {                                                \
> > > +                   netif_tx_start_queue(txq);                      \
> > > +                   _res = -1;                                      \
> > > +           }                                                       \
> > > +           _res;                                                   \
> > > +   })                                                              \
> > > +
> >
> > The issue I see here is that with this being a macro it abstracts away
> > the relationship between get_desc and the memory barrier. At a minimum
> > I think we should be treating get_desc as a function instead of just
> > passing it and its arguments as a single value. Maybe something more
> > like how read_poll_timeout passes the "op" and then uses it as a
> > function with args passed seperately. What we want to avoid is passing
> > a precomuted value to this function as get_desc.
>
> The kdocs hopefully have enough warnings. The issue I see with
> read_poll_timeout() is that I always have to have the definition
> open side by side to match up the arguments. I wish there was
> a way the test that something is not an lval, but I couldn't
> find it :(
>
> Let's see if anyone gets this wrong, you can tell me "I told you so"?

The setup for it makes me really uncomfortable. Passing a function w/
arguments as an argument itself usually implies that it is called
first, not during.

> > In addition I think I would prefer to see _res initialized to the
> > likely value so that we can drop this to one case instead of having to
> > have two. Same thing for the macros below.
>
> Alright.
>
> > > +/**
> > > + * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
> > > + * @txq:   struct netdev_queue to stop/start
> > > + * @get_desc:      get current number of free descriptors (see requirements below!)
> > > + * @stop_thrs:     minimal number of available descriptors for queue to be left
> > > + *         enabled
> > > + * @start_thrs:    minimal number of descriptors to re-enable the queue, can be
> > > + *         equal to @stop_thrs or higher to avoid frequent waking
> > > + *
> > > + * All arguments may be evaluated multiple times, beware of side effects.
> > > + * @get_desc must be a formula or a function call, it must always
> > > + * return up-to-date information when evaluated!
> > > + * Expected to be used from ndo_start_xmit, see the comment on top of the file.
> > > + *
> > > + * Returns:
> > > + *  0 if the queue was stopped
> > > + *  1 if the queue was left enabled
> > > + * -1 if the queue was re-enabled (raced with waking)
> > > + */
> >
> > We may want to change the values here. The most likely case is "left
> > enabled" with that being the case we probably want to make that the 0
> > case. I would then probably make 1 the re-enabled case and -1 the
> > stopped case.
>
> I chose the return values this way because the important information is
> whether the queue was in fact stopped (in case the macro is used at the
> start of .xmit as a safety check). If stopped is zero caller can check
> !ret vs !!ret.
>
> Seems pretty normal for the kernel function called stop() to return 0
> if it did stop.

Except this isn't "stop", this is "maybe stop". Maybe we should just
do away with the stop/wake messaging and go with something such as a
RTS/CTS type setup. Basically this function is acting as a RTS to
verify that we have room on the ring to place the frame. If we don't
we are stopped. The "wake" function is on what is essentially the
receiving end on the other side of the hardware after it has DMAed the
frames and is providing the CTS signal back.

> > With that the decision tree becomes more straightforward as we would do
> > something like:
> >       if (result) {
> >               if (result < 0)
> >                       Increment stopped stat
> >                       return
> >               else
> >                       Increment restarted stat
> >       }
>
> Do you see a driver where it matters? ixgbe and co. use
> netif_tx_queue_try_stop() and again they just test stopped vs not stopped.

The thing is in order to make this work for the ixgbe patch you didn't
use the maybe_stop instead you went with the try_stop. If you replaced
the ixgbe_maybe_stop_tx with your maybe stop would have to do
something such as the code above to make it work. That is what I am
getting at. From what I can tell the only real difference between
ixgbe_maybe_stop_tx and your maybe_stop is that you avoided having to
move the restart_queue stat increment out.

The general thought is I would prefer to keep it so that 0 is the
default most likely case in both where the queue is enabled and is
still enabled. By moving the "take action" items into the 1/-1 values
then it becomes much easier to sort them out with 1 being a stat
increment and -1 being an indication to stop transmitting and prep for
watchdog hang if we don't clear this in the next watchdog period.

Also in general it makes it easier to understand if these all work
with the same logic.

> > In addition for readability we may want consider adding an enum simliar
> > to the netdev_tx enum as then we have the return types locked and
> > usable should we want to specifically pick out one.
>
> Hm, I thought people generally dislike the netdev_tx enum.
> Maybe it's just me.

The thought I had with the enum is to more easily connect the outcomes
with the sources. It would also help to prevent any confusion on what
is what. Having the two stop/wake functions return different values is
a potential source for errors since 0/1 means different things in the
different functions. Basically since we have 3 possible outcomes using
the enum would make it very clear what the mapping is between the two.

> > > +#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
> > > +   ({                                                              \
> > > +           int _res;                                               \
> > > +                                                                   \
> > > +           if (likely(get_desc < start_thrs))                      \
> > > +                   _res = -1;                                      \
> > > +           else                                                    \
> > > +                   _res = __netif_tx_queue_try_wake(txq, get_desc, \
> > > +                                                    start_thrs,    \
> > > +                                                    down_cond);    \
> > > +           _res;                                                   \
> > > +   })
> > > +
> >
> > The likely here is probably not correct. In most cases the queue will
> > likely have enough descriptors to enable waking since Tx cleanup can
> > usually run pretty fast compared to the transmit path itself since it
> > can run without needing to take locks.
>
> Good catch, must be a copy paste from the other direction without
> inverting the likely.

Actually the original code had it there in ixgbe although the check
also included total_packets in it which implies that maybe the
assumption was that Tx cleanup wasn't occurring as often as Rx? Either
that or it was a carry-over and had a check for
__netif_subqueue_stopped included in there previously.
Jakub Kicinski March 24, 2023, 9:28 p.m. UTC | #12
On Fri, 24 Mar 2023 08:45:23 -0700 Alexander Duyck wrote:
> On Thu, Mar 23, 2023 at 8:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > We may want to change the values here. The most likely case is "left
> > > enabled" with that being the case we probably want to make that the 0
> > > case. I would then probably make 1 the re-enabled case and -1 the
> > > stopped case.  
> >
> > I chose the return values this way because the important information is
> > whether the queue was in fact stopped (in case the macro is used at the
> > start of .xmit as a safety check). If stopped is zero caller can check
> > !ret vs !!ret.
> >
> > Seems pretty normal for the kernel function called stop() to return 0
> > if it did stop.  
> 
> Except this isn't "stop", this is "maybe stop".

So the return value from try_stop and maybe_stop would be different?
try_stop needs to return 0 if it stopped - the same semantics as
trylock(), AFAIR. Not that I love those semantics, but it's a fairly
strong precedent.

> Maybe we should just
> do away with the stop/wake messaging and go with something such as a
> RTS/CTS type setup. Basically this function is acting as a RTS to
> verify that we have room on the ring to place the frame. If we don't
> we are stopped. The "wake" function is on what is essentially the
> receiving end on the other side of the hardware after it has DMAed the
> frames and is providing the CTS signal back.

I'm definitely open to different naming but wouldn't calling RTS _after_
send be even more confusing than maybe_stop?

> > > With that the decision tree becomes more straightforward as we would do
> > > something like:
> > >       if (result) {
> > >               if (result < 0)
> > >                       Increment stopped stat
> > >                       return
> > >               else
> > >                       Increment restarted stat
> > >       }  
> >
> > Do you see a driver where it matters? ixgbe and co. use
> > netif_tx_queue_try_stop() and again they just test stopped vs not stopped.  
> 
> The thing is in order to make this work for the ixgbe patch you didn't
> use the maybe_stop instead you went with the try_stop. If you replaced
> the ixgbe_maybe_stop_tx with your maybe stop would have to do
> something such as the code above to make it work. That is what I am
> getting at. From what I can tell the only real difference between
> ixgbe_maybe_stop_tx and your maybe_stop is that you avoided having to
> move the restart_queue stat increment out.

I can convert ixgbe further, true, but I needed the try_stop, anyway,
because bnxt does:

if (/* need to stop */) {
	if (xmit_more())
		flush_db_write();
	netif_tx_queue_try_stop();
}

which seems reasonable.

> The general thought is I would prefer to keep it so that 0 is the
> default most likely case in both where the queue is enabled and is
> still enabled. By moving the "take action" items into the 1/-1 values
> then it becomes much easier to sort them out with 1 being a stat
> increment and -1 being an indication to stop transmitting and prep for
> watchdog hang if we don't clear this in the next watchdog period.

Maybe worth taking a step back - the restart stat which ixgbe
maintains made perfect sense when you pioneered this approach but
I think we had a decade of use, and have kprobes now, so we don't
really need to maintain a statistic for a condition with no impact 
to the user? New driver should not care 1 vs -1..

> Also in general it makes it easier to understand if these all work
> with the same logic.
> 
> > > In addition for readability we may want consider adding an enum simliar
> > > to the netdev_tx enum as then we have the return types locked and
> > > usable should we want to specifically pick out one.  
> >
> > Hm, I thought people generally dislike the netdev_tx enum.
> > Maybe it's just me.  
> 
> The thought I had with the enum is to more easily connect the outcomes
> with the sources. It would also help to prevent any confusion on what
> is what. Having the two stop/wake functions return different values is
> a potential source for errors since 0/1 means different things in the
> different functions. Basically since we have 3 possible outcomes using
> the enum would make it very clear what the mapping is between the two.

IMO only two outcomes matter in practice (as mentioned above).
I really like the ability to treat the return value as a bool, if only
we had negative zero we would have a perfect compromise :(
Alexander H Duyck March 26, 2023, 9:23 p.m. UTC | #13
On Fri, Mar 24, 2023 at 2:28 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 24 Mar 2023 08:45:23 -0700 Alexander Duyck wrote:
> > On Thu, Mar 23, 2023 at 8:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > We may want to change the values here. The most likely case is "left
> > > > enabled" with that being the case we probably want to make that the 0
> > > > case. I would then probably make 1 the re-enabled case and -1 the
> > > > stopped case.
> > >
> > > I chose the return values this way because the important information is
> > > whether the queue was in fact stopped (in case the macro is used at the
> > > start of .xmit as a safety check). If stopped is zero caller can check
> > > !ret vs !!ret.
> > >
> > > Seems pretty normal for the kernel function called stop() to return 0
> > > if it did stop.
> >
> > Except this isn't "stop", this is "maybe stop".
>
> So the return value from try_stop and maybe_stop would be different?
> try_stop needs to return 0 if it stopped - the same semantics as
> trylock(), AFAIR. Not that I love those semantics, but it's a fairly
> strong precedent.

The problem is this isn't a lock. Ideally with this we aren't taking
the action. So if anything this functions in my mind more like the
inverse where if this does stop we have to abort more like trylock
failing.

This is why I mentioned that maybe this should be renamed. I view this
more as a check to verify we are good to proceed. In addition there is
the problem that there are 3 possible outcomes with maybe_stop versus
the two from try_stop.

> > Maybe we should just
> > do away with the stop/wake messaging and go with something such as a
> > RTS/CTS type setup. Basically this function is acting as a RTS to
> > verify that we have room on the ring to place the frame. If we don't
> > we are stopped. The "wake" function is on what is essentially the
> > receiving end on the other side of the hardware after it has DMAed the
> > frames and is providing the CTS signal back.
>
> I'm definitely open to different naming but wouldn't calling RTS _after_
> send be even more confusing than maybe_stop?

We don't call maybe_stop after the send. For that we would be calling
try_stop. The difference being in the case of maybe_stop we might have
to return busy.

> > > > With that the decision tree becomes more straightforward as we would do
> > > > something like:
> > > >       if (result) {
> > > >               if (result < 0)
> > > >                       Increment stopped stat
> > > >                       return
> > > >               else
> > > >                       Increment restarted stat
> > > >       }
> > >
> > > Do you see a driver where it matters? ixgbe and co. use
> > > netif_tx_queue_try_stop() and again they just test stopped vs not stopped.
> >
> > The thing is in order to make this work for the ixgbe patch you didn't
> > use the maybe_stop instead you went with the try_stop. If you replaced
> > the ixgbe_maybe_stop_tx with your maybe stop would have to do
> > something such as the code above to make it work. That is what I am
> > getting at. From what I can tell the only real difference between
> > ixgbe_maybe_stop_tx and your maybe_stop is that you avoided having to
> > move the restart_queue stat increment out.
>
> I can convert ixgbe further, true, but I needed the try_stop, anyway,
> because bnxt does:
>
> if (/* need to stop */) {
>         if (xmit_more())
>                 flush_db_write();
>         netif_tx_queue_try_stop();
> }
>
> which seems reasonable.

I wasn't saying we didn't need try_stop. However the logic here
doesn't care about the return value. In the ixgbe case we track the
queue restarts so we would want a 0 on success and a non-zero if we
have to increment the stat. I would be okay with the 0 (success) / -1
(queue restarted) in this case.

> > The general thought is I would prefer to keep it so that 0 is the
> > default most likely case in both where the queue is enabled and is
> > still enabled. By moving the "take action" items into the 1/-1 values
> > then it becomes much easier to sort them out with 1 being a stat
> > increment and -1 being an indication to stop transmitting and prep for
> > watchdog hang if we don't clear this in the next watchdog period.
>
> Maybe worth taking a step back - the restart stat which ixgbe
> maintains made perfect sense when you pioneered this approach but
> I think we had a decade of use, and have kprobes now, so we don't
> really need to maintain a statistic for a condition with no impact
> to the user? New driver should not care 1 vs -1..

Actually the restart_queue stat is VERY useful for debugging. It tells
us we are seeing backlogs develop in the Tx queue. We track it any
time we wake up the queue, not just in the maybe_stop case.

WIthout that we are then having to break out kprobes and the like
which we could only add after-the-fact which makes things much harder
to debug when issues occur. For example, a common case to use it is to
monitor it when we see a system with slow Tx connections. With that
stat we can tell if we are building a backlog in the qdisc or if it is
something else such as a limited amount of socket memory is limiting
the transmits.

> > Also in general it makes it easier to understand if these all work
> > with the same logic.
> >
> > > > In addition for readability we may want consider adding an enum simliar
> > > > to the netdev_tx enum as then we have the return types locked and
> > > > usable should we want to specifically pick out one.
> > >
> > > Hm, I thought people generally dislike the netdev_tx enum.
> > > Maybe it's just me.
> >
> > The thought I had with the enum is to more easily connect the outcomes
> > with the sources. It would also help to prevent any confusion on what
> > is what. Having the two stop/wake functions return different values is
> > a potential source for errors since 0/1 means different things in the
> > different functions. Basically since we have 3 possible outcomes using
> > the enum would make it very clear what the mapping is between the two.
>
> IMO only two outcomes matter in practice (as mentioned above).
> I really like the ability to treat the return value as a bool, if only
> we had negative zero we would have a perfect compromise :(

I think we are just thinking about two different things. I am focusing
on the "maybe" calls that have 3 outcomes whereas I think you are
mostly focused on the "try" calls. My thought is to treat it something
like the msix allocation calls where a negative indicates a failure
forcing us to stop since the ring is full, 0 is a success, and a value
indicates that there are resources but they are/were limited.
Jakub Kicinski March 29, 2023, 12:56 a.m. UTC | #14
On Sun, 26 Mar 2023 14:23:07 -0700 Alexander Duyck wrote:
> > > Except this isn't "stop", this is "maybe stop".  
> >
> > So the return value from try_stop and maybe_stop would be different?
> > try_stop needs to return 0 if it stopped - the same semantics as
> > trylock(), AFAIR. Not that I love those semantics, but it's a fairly
> > strong precedent.  
> 
> The problem is this isn't a lock. Ideally with this we aren't taking
> the action. So if anything this functions in my mind more like the
> inverse where if this does stop we have to abort more like trylock
> failing.

No.. for try_stop we are trying to stop.

> This is why I mentioned that maybe this should be renamed. I view this
> more as a check to verify we are good to proceed. In addition there is
> the problem that there are 3 possible outcomes with maybe_stop versus
> the two from try_stop.

I'm open to other names :S

> > > The thing is in order to make this work for the ixgbe patch you didn't
> > > use the maybe_stop instead you went with the try_stop. If you replaced
> > > the ixgbe_maybe_stop_tx with your maybe stop would have to do
> > > something such as the code above to make it work. That is what I am
> > > getting at. From what I can tell the only real difference between
> > > ixgbe_maybe_stop_tx and your maybe_stop is that you avoided having to
> > > move the restart_queue stat increment out.  
> >
> > I can convert ixgbe further, true, but I needed the try_stop, anyway,
> > because bnxt does:
> >
> > if (/* need to stop */) {
> >         if (xmit_more())
> >                 flush_db_write();
> >         netif_tx_queue_try_stop();
> > }
> >
> > which seems reasonable.  
> 
> I wasn't saying we didn't need try_stop. However the logic here
> doesn't care about the return value. In the ixgbe case we track the
> queue restarts so we would want a 0 on success and a non-zero if we
> have to increment the stat. I would be okay with the 0 (success) / -1
> (queue restarted) in this case.
> 
> > > The general thought is I would prefer to keep it so that 0 is the
> > > default most likely case in both where the queue is enabled and is
> > > still enabled. By moving the "take action" items into the 1/-1 values
> > > then it becomes much easier to sort them out with 1 being a stat
> > > increment and -1 being an indication to stop transmitting and prep for
> > > watchdog hang if we don't clear this in the next watchdog period.  
> >
> > Maybe worth taking a step back - the restart stat which ixgbe
> > maintains made perfect sense when you pioneered this approach but
> > I think we had a decade of use, and have kprobes now, so we don't
> > really need to maintain a statistic for a condition with no impact
> > to the user? New driver should not care 1 vs -1..  
> 
> Actually the restart_queue stat is VERY useful for debugging. It tells
> us we are seeing backlogs develop in the Tx queue. We track it any
> time we wake up the queue, not just in the maybe_stop case.
> 
> WIthout that we are then having to break out kprobes and the like
> which we could only add after-the-fact which makes things much harder
> to debug when issues occur. For example, a common case to use it is to
> monitor it when we see a system with slow Tx connections. With that
> stat we can tell if we are building a backlog in the qdisc or if it is
> something else such as a limited amount of socket memory is limiting
> the transmits.

Oh, I missed that wake uses the same stat. Let me clarify - the
stop/start counter is definitely useful. What I thought the restart
counter is counting is just the race cases. I don't think the race
cases are worth counting in any way.

> > > The thought I had with the enum is to more easily connect the outcomes
> > > with the sources. It would also help to prevent any confusion on what
> > > is what. Having the two stop/wake functions return different values is
> > > a potential source for errors since 0/1 means different things in the
> > > different functions. Basically since we have 3 possible outcomes using
> > > the enum would make it very clear what the mapping is between the two.  
> >
> > IMO only two outcomes matter in practice (as mentioned above).
> > I really like the ability to treat the return value as a bool, if only
> > we had negative zero we would have a perfect compromise :(  
> 
> I think we are just thinking about two different things. I am focusing
> on the "maybe" calls that have 3 outcomes whereas I think you are
> mostly focused on the "try" calls. My thought is to treat it something
> like the msix allocation calls where a negative indicates a failure
> forcing us to stop since the ring is full, 0 is a success, and a value
> indicates that there are resources but they are/were limited.

I don't see a strong analogy to PCI resource allocation :(

I prefer to keep the 0 vs non-0 distinction to indicate whether 
the action was performed.

Paolo, Eric, any opinion? Other than the one likely vs unlikely
flip -- is this good enough to merge for you?
Paolo Abeni March 30, 2023, 2:56 p.m. UTC | #15
Hi,

On Tue, 2023-03-28 at 17:56 -0700, Jakub Kicinski wrote:
> On Sun, 26 Mar 2023 14:23:07 -0700 Alexander Duyck wrote:
> > > > Except this isn't "stop", this is "maybe stop".  
> > > 
> > > So the return value from try_stop and maybe_stop would be different?
> > > try_stop needs to return 0 if it stopped - the same semantics as
> > > trylock(), AFAIR. Not that I love those semantics, but it's a fairly
> > > strong precedent.  
> > 
> > The problem is this isn't a lock. Ideally with this we aren't taking
> > the action. So if anything this functions in my mind more like the
> > inverse where if this does stop we have to abort more like trylock
> > failing.
> 
> No.. for try_stop we are trying to stop.
> 
> > This is why I mentioned that maybe this should be renamed. I view this
> > more as a check to verify we are good to proceed. In addition there is
> > the problem that there are 3 possible outcomes with maybe_stop versus
> > the two from try_stop.
> 
> I'm open to other names :S
> 
> > > > The thing is in order to make this work for the ixgbe patch you didn't
> > > > use the maybe_stop instead you went with the try_stop. If you replaced
> > > > the ixgbe_maybe_stop_tx with your maybe stop would have to do
> > > > something such as the code above to make it work. That is what I am
> > > > getting at. From what I can tell the only real difference between
> > > > ixgbe_maybe_stop_tx and your maybe_stop is that you avoided having to
> > > > move the restart_queue stat increment out.  
> > > 
> > > I can convert ixgbe further, true, but I needed the try_stop, anyway,
> > > because bnxt does:
> > > 
> > > if (/* need to stop */) {
> > >         if (xmit_more())
> > >                 flush_db_write();
> > >         netif_tx_queue_try_stop();
> > > }
> > > 
> > > which seems reasonable.  
> > 
> > I wasn't saying we didn't need try_stop. However the logic here
> > doesn't care about the return value. In the ixgbe case we track the
> > queue restarts so we would want a 0 on success and a non-zero if we
> > have to increment the stat. I would be okay with the 0 (success) / -1
> > (queue restarted) in this case.
> > 
> > > > The general thought is I would prefer to keep it so that 0 is the
> > > > default most likely case in both where the queue is enabled and is
> > > > still enabled. By moving the "take action" items into the 1/-1 values
> > > > then it becomes much easier to sort them out with 1 being a stat
> > > > increment and -1 being an indication to stop transmitting and prep for
> > > > watchdog hang if we don't clear this in the next watchdog period.  
> > > 
> > > Maybe worth taking a step back - the restart stat which ixgbe
> > > maintains made perfect sense when you pioneered this approach but
> > > I think we had a decade of use, and have kprobes now, so we don't
> > > really need to maintain a statistic for a condition with no impact
> > > to the user? New driver should not care 1 vs -1..  
> > 
> > Actually the restart_queue stat is VERY useful for debugging. It tells
> > us we are seeing backlogs develop in the Tx queue. We track it any
> > time we wake up the queue, not just in the maybe_stop case.
> > 
> > WIthout that we are then having to break out kprobes and the like
> > which we could only add after-the-fact which makes things much harder
> > to debug when issues occur. For example, a common case to use it is to
> > monitor it when we see a system with slow Tx connections. With that
> > stat we can tell if we are building a backlog in the qdisc or if it is
> > something else such as a limited amount of socket memory is limiting
> > the transmits.
> 
> Oh, I missed that wake uses the same stat. Let me clarify - the
> stop/start counter is definitely useful. What I thought the restart
> counter is counting is just the race cases. I don't think the race
> cases are worth counting in any way.
> 
> > > > The thought I had with the enum is to more easily connect the outcomes
> > > > with the sources. It would also help to prevent any confusion on what
> > > > is what. Having the two stop/wake functions return different values is
> > > > a potential source for errors since 0/1 means different things in the
> > > > different functions. Basically since we have 3 possible outcomes using
> > > > the enum would make it very clear what the mapping is between the two.  
> > > 
> > > IMO only two outcomes matter in practice (as mentioned above).
> > > I really like the ability to treat the return value as a bool, if only
> > > we had negative zero we would have a perfect compromise :(  
> > 
> > I think we are just thinking about two different things. I am focusing
> > on the "maybe" calls that have 3 outcomes whereas I think you are
> > mostly focused on the "try" calls. My thought is to treat it something
> > like the msix allocation calls where a negative indicates a failure
> > forcing us to stop since the ring is full, 0 is a success, and a value
> > indicates that there are resources but they are/were limited.
> 
> I don't see a strong analogy to PCI resource allocation :(
> 
> I prefer to keep the 0 vs non-0 distinction to indicate whether 
> the action was performed.
> 
> Paolo, Eric, any opinion? Other than the one likely vs unlikely
> flip -- is this good enough to merge for you?

As you know I'm usually horrible at name related choice, but you asked,
so...

I'm personally ok with the current naming, and AFAICS the coding style
guidelines suggest returning 0 when imperative functions complete
successfully. 

I think we should apply the guidelines here, even if are talking about
macros.

That means netif_tx_queue_maybe_stop() and netif_tx_queue_try_stop()
should return 0 when the queue is actually stopped.

I'm personally fine with the current implementation.

Cheers,

Paolo
diff mbox series

Patch

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
new file mode 100644
index 000000000000..64e059647274
--- /dev/null
+++ b/include/net/netdev_queues.h
@@ -0,0 +1,171 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_NET_QUEUES_H
+#define _LINUX_NET_QUEUES_H
+
+#include <linux/netdevice.h>
+
+/* Lockless queue stopping / waking helpers.
+ *
+ * These macros are designed to safely implement stopping and waking
+ * netdev queues without full lock protection. We assume that there can
+ * be no concurrent stop attempts and no concurrent wake attempts.
+ * The try-stop should happen from the xmit handler*, while wake up
+ * should be triggered from NAPI poll context. The two may run
+ * concurrently (single producer, single consumer).
+ *
+ * All descriptor ring indexes (and other relevant shared state) must
+ * be updated before invoking the macros.
+ *
+ * * the try-stop side does not reschedule Tx (netif_tx_start_queue()
+ *   instead of netif_tx_wake_queue()) so uses outside of the xmit
+ *   handler may lead to bugs
+ */
+
+#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
+	({								\
+		int _res;						\
+									\
+		netif_tx_stop_queue(txq);				\
+									\
+		smp_mb();						\
+									\
+		/* We need to check again in a case another		\
+		 * CPU has just made room available.			\
+		 */							\
+		if (likely(get_desc < start_thrs)) {			\
+			_res = 0;					\
+		} else {						\
+			netif_tx_start_queue(txq);			\
+			_res = -1;					\
+		}							\
+		_res;							\
+	})								\
+
+/**
+ * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
+ * @txq:	struct netdev_queue to stop/start
+ * @get_desc:	get current number of free descriptors (see requirements below!)
+ * @stop_thrs:	minimal number of available descriptors for queue to be left
+ *		enabled
+ * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
+ *		equal to @stop_thrs or higher to avoid frequent waking
+ *
+ * All arguments may be evaluated multiple times, beware of side effects.
+ * @get_desc must be a formula or a function call, it must always
+ * return up-to-date information when evaluated!
+ * Expected to be used from ndo_start_xmit, see the comment on top of the file.
+ *
+ * Returns:
+ *	 0 if the queue was stopped
+ *	 1 if the queue was left enabled
+ *	-1 if the queue was re-enabled (raced with waking)
+ */
+#define netif_tx_queue_maybe_stop(txq, get_desc, stop_thrs, start_thrs)	\
+	({								\
+		int _res;						\
+									\
+		if (likely(get_desc > stop_thrs))			\
+			_res = 1;					\
+		else							\
+			_res = netif_tx_queue_try_stop(txq, get_desc,	\
+						       start_thrs);	\
+		_res;							\
+	})								\
+
+#define __netif_tx_queue_try_wake(txq, get_desc, start_thrs, down_cond) \
+	({								\
+		int _res;						\
+									\
+		/* Make sure that anybody stopping the queue after	\
+		 * this sees the new next_to_clean.			\
+		 */							\
+		smp_mb();						\
+		if (netif_tx_queue_stopped(txq) && !(down_cond)) {	\
+			netif_tx_wake_queue(txq);			\
+			_res = 0;					\
+		} else {						\
+			_res = 1;					\
+		}							\
+		_res;							\
+	})
+
+#define netif_tx_queue_try_wake(txq, get_desc, start_thrs)		\
+	__netif_tx_queue_try_wake(txq, get_desc, start_thrs, false)
+
+/**
+ * __netif_tx_queue_maybe_wake() - locklessly wake a Tx queue, if needed
+ * @txq:	struct netdev_queue to stop/start
+ * @get_desc:	get current number of free descriptors (see requirements below!)
+ * @start_thrs:	minimal number of descriptors to re-enable the queue
+ * @down_cond:	down condition, predicate indicating that the queue should
+ *		not be woken up even if descriptors are available
+ *
+ * All arguments may be evaluated multiple times.
+ * @get_desc must be a formula or a function call, it must always
+ * return up-to-date information when evaluated!
+ *
+ * Returns:
+ *	 0 if the queue was woken up
+ *	 1 if the queue was already enabled (or disabled but @down_cond is true)
+ *	-1 if the queue was left stopped
+ */
+#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
+	({								\
+		int _res;						\
+									\
+		if (likely(get_desc < start_thrs))			\
+			_res = -1;					\
+		else							\
+			_res = __netif_tx_queue_try_wake(txq, get_desc,	\
+							 start_thrs,	\
+							 down_cond);	\
+		_res;							\
+	})
+
+#define netif_tx_queue_maybe_wake(txq, get_desc, start_thrs)		\
+	__netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, false)
+
+/* subqueue variants follow */
+
+#define netif_subqueue_try_stop(dev, idx, get_desc, start_thrs)		\
+	({								\
+		struct netdev_queue *txq;				\
+									\
+		txq = netdev_get_tx_queue(dev, idx);			\
+		netif_tx_queue_try_stop(txq, get_desc, start_thrs);	\
+	})
+
+#define netif_subqueue_maybe_stop(dev, idx, get_desc, stop_thrs, start_thrs) \
+	({								\
+		struct netdev_queue *txq;				\
+									\
+		txq = netdev_get_tx_queue(dev, idx);			\
+		netif_tx_queue_maybe_stop(txq, get_desc,		\
+					  stop_thrs, start_thrs);	\
+	})
+
+#define __netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, down_cond) \
+	({								\
+		struct netdev_queue *txq;				\
+									\
+		txq = netdev_get_tx_queue(dev, idx);			\
+		__netif_tx_queue_try_wake(txq, get_desc,		\
+					  start_thrs, down_cond);	\
+	})
+
+#define netif_subqueue_try_wake(dev, idx, get_desc, start_thrs)		\
+	__netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, false)
+
+#define __netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, down_cond) \
+	({								\
+		struct netdev_queue *txq;				\
+									\
+		txq = netdev_get_tx_queue(dev, idx);			\
+		__netif_tx_queue_maybe_wake(txq, get_desc,		\
+					    start_thrs, down_cond);	\
+	})
+
+#define netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs)	\
+	__netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, false)
+
+#endif