diff mbox series

[net-next,v3] net: dqs: add NIC stall detector based on BQL

Message ID 20240202165315.2506384-1-leitao@debian.org (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next,v3] net: dqs: add NIC stall detector based on BQL | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5173 this patch: 5173
netdev/build_tools success Errors and warnings before: 1 this patch: 0
netdev/cc_maintainers success CCed 0 of 0 maintainers
netdev/build_clang success Errors and warnings before: 1193 this patch: 1193
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5475 this patch: 5475
netdev/checkpatch fail CHECK: Alignment should match open parenthesis CHECK: Lines should not end with a '(' ERROR: space prohibited after that open parenthesis '('
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-02-04--21-00 (tests: 721)

Commit Message

Breno Leitao Feb. 2, 2024, 4:53 p.m. UTC
From: Jakub Kicinski <kuba@kernel.org>

softnet_data->time_squeeze is sometimes used as a proxy for
host overload or indication of scheduling problems. In practice
this statistic is very noisy and has hard to grasp units -
e.g. is 10 squeezes a second to be expected, or high?

Delaying network (NAPI) processing leads to drops on NIC queues
but also RTT bloat, impacting pacing and CA decisions.
Stalls are a little hard to detect on the Rx side, because
there may simply have not been any packets received in given
period of time. Packet timestamps help a little bit, but
again we don't know if packets are stale because we're
not keeping up or because someone (*cough* cgroups)
disabled IRQs for a long time.

We can, however, use Tx as a proxy for Rx stalls. Most drivers
use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
On the Tx side we know exactly when packets get queued,
and completed, so there is no uncertainty.

This patch adds stall checks to BQL. Why BQL? Because
it's a convenient place to add such checks, already
called by most drivers, and it has copious free space
in its structures (this patch adds no extra cache
references or dirtying to the fast path).

The algorithm takes one parameter - max delay AKA stall
threshold and increments a counter whenever NAPI got delayed
for at least that amount of time. It also records the length
of the longest stall.

To be precise every time NAPI has not polled for at least
stall thrs we check if there were any Tx packets queued
between last NAPI run and now - stall_thrs/2.

Unlike the classic Tx watchdog this mechanism does not
ignore stalls caused by Tx being disabled, or loss of link.
I don't think the check is worth the complexity, and
stall is a stall, whether due to host overload, flow
control, link down... doesn't matter much to the application.

We have been running this detector in production at Meta
for 2 years, with the threshold of 8ms. It's the lowest
value where false positives become rare. There's still
a constant stream of reported stalls (especially without
the ksoftirqd deferral patches reverted), those who like
their stall metrics to be 0 may prefer higher value.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changelog:

v1:
  * https://lore.kernel.org/netdev/202306172057.jx7YhLiu-lkp@intel.com/T/

v2:
  * Fix the documentation file in patch 0001, since patch 0002 will
    touch it later.
  * Fix the kernel test robot issues, marking functions as statics.
  * Use #include <linux/bitops.h> instead of <asm/bitops.h>.
  * Added some new comments around, mainly around barriers.
  * Format struct `netdev_queue_attribute` assignments to make
    checkpatch happy.
  * Updated and fixed the path in sysfs-class-net-queues
    documentation.

v3:
  * Sending patch 0002 against net-next.
	- The first patch was accepted against 'net'

---
 .../ABI/testing/sysfs-class-net-queues        | 23 +++++++
 include/linux/dynamic_queue_limits.h          | 35 +++++++++++
 include/trace/events/napi.h                   | 33 ++++++++++
 lib/dynamic_queue_limits.c                    | 58 +++++++++++++++++
 net/core/net-sysfs.c                          | 62 +++++++++++++++++++
 5 files changed, 211 insertions(+)

Comments

Paolo Abeni Feb. 6, 2024, 11:40 a.m. UTC | #1
On Fri, 2024-02-02 at 08:53 -0800, Breno Leitao wrote:
> From: Jakub Kicinski <kuba@kernel.org>
> 
> softnet_data->time_squeeze is sometimes used as a proxy for
> host overload or indication of scheduling problems. In practice
> this statistic is very noisy and has hard to grasp units -
> e.g. is 10 squeezes a second to be expected, or high?
> 
> Delaying network (NAPI) processing leads to drops on NIC queues
> but also RTT bloat, impacting pacing and CA decisions.
> Stalls are a little hard to detect on the Rx side, because
> there may simply have not been any packets received in given
> period of time. Packet timestamps help a little bit, but
> again we don't know if packets are stale because we're
> not keeping up or because someone (*cough* cgroups)
> disabled IRQs for a long time.
> 
> We can, however, use Tx as a proxy for Rx stalls. Most drivers
> use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
> On the Tx side we know exactly when packets get queued,
> and completed, so there is no uncertainty.
> 
> This patch adds stall checks to BQL. Why BQL? Because
> it's a convenient place to add such checks, already
> called by most drivers, and it has copious free space
> in its structures (this patch adds no extra cache
> references or dirtying to the fast path).
> 
> The algorithm takes one parameter - max delay AKA stall
> threshold and increments a counter whenever NAPI got delayed
> for at least that amount of time. It also records the length
> of the longest stall.
> 
> To be precise every time NAPI has not polled for at least
> stall thrs we check if there were any Tx packets queued
> between last NAPI run and now - stall_thrs/2.
> 
> Unlike the classic Tx watchdog this mechanism does not
> ignore stalls caused by Tx being disabled, or loss of link.
> I don't think the check is worth the complexity, and
> stall is a stall, whether due to host overload, flow
> control, link down... doesn't matter much to the application.
> 
> We have been running this detector in production at Meta
> for 2 years, with the threshold of 8ms. It's the lowest
> value where false positives become rare. There's still
> a constant stream of reported stalls (especially without
> the ksoftirqd deferral patches reverted), those who like
> their stall metrics to be 0 may prefer higher value.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> Changelog:
> 
> v1:
>   * https://lore.kernel.org/netdev/202306172057.jx7YhLiu-lkp@intel.com/T/
> 
> v2:
>   * Fix the documentation file in patch 0001, since patch 0002 will
>     touch it later.
>   * Fix the kernel test robot issues, marking functions as statics.
>   * Use #include <linux/bitops.h> instead of <asm/bitops.h>.
>   * Added some new comments around, mainly around barriers.
>   * Format struct `netdev_queue_attribute` assignments to make
>     checkpatch happy.
>   * Updated and fixed the path in sysfs-class-net-queues
>     documentation.
> 
> v3:
>   * Sending patch 0002 against net-next.
> 	- The first patch was accepted against 'net'
> 
> ---
>  .../ABI/testing/sysfs-class-net-queues        | 23 +++++++
>  include/linux/dynamic_queue_limits.h          | 35 +++++++++++
>  include/trace/events/napi.h                   | 33 ++++++++++
>  lib/dynamic_queue_limits.c                    | 58 +++++++++++++++++
>  net/core/net-sysfs.c                          | 62 +++++++++++++++++++
>  5 files changed, 211 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-class-net-queues b/Documentation/ABI/testing/sysfs-class-net-queues
> index 5bff64d256c2..45bab9b6e3ae 100644
> --- a/Documentation/ABI/testing/sysfs-class-net-queues
> +++ b/Documentation/ABI/testing/sysfs-class-net-queues
> @@ -96,3 +96,26 @@ Description:
>  		Indicates the absolute minimum limit of bytes allowed to be
>  		queued on this network device transmit queue. Default value is
>  		0.
> +
> +What:		/sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_thrs
> +Date:		Jan 2024
> +KernelVersion:	6.9
> +Contact:	netdev@vger.kernel.org
> +Description:
> +		Tx completion stall detection threshold. 

Possibly worth mentioning it's in milliseconds

> Kernel will guarantee
> +		to detect all stalls longer than this threshold but may also
> +		detect stalls longer than half of the threshold.
> +
> +What:		/sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_cnt
> +Date:		Jan 2024
> +KernelVersion:	6.9
> +Contact:	netdev@vger.kernel.org
> +Description:
> +		Number of detected Tx completion stalls.
> +
> +What:		/sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_max
> +Date:		Jan 2024
> +KernelVersion:	6.9
> +Contact:	netdev@vger.kernel.org
> +Description:
> +		Longest detected Tx completion stall. Write 0 to clear.
> diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
> index 407c2f281b64..288e98fe85f0 100644
> --- a/include/linux/dynamic_queue_limits.h
> +++ b/include/linux/dynamic_queue_limits.h
> @@ -38,14 +38,21 @@
>  
>  #ifdef __KERNEL__
>  
> +#include <linux/bitops.h>
>  #include <asm/bug.h>
>  
> +#define DQL_HIST_LEN		4
> +#define DQL_HIST_ENT(dql, idx)	((dql)->history[(idx) % DQL_HIST_LEN])
> +
>  struct dql {
>  	/* Fields accessed in enqueue path (dql_queued) */
>  	unsigned int	num_queued;		/* Total ever queued */
>  	unsigned int	adj_limit;		/* limit + num_completed */
>  	unsigned int	last_obj_cnt;		/* Count at last queuing */
>  
> +	unsigned long	history_head;
> +	unsigned long	history[DQL_HIST_LEN];
> +
>  	/* Fields accessed only by completion path (dql_completed) */
>  
>  	unsigned int	limit ____cacheline_aligned_in_smp; /* Current limit */
> @@ -62,6 +69,11 @@ struct dql {
>  	unsigned int	max_limit;		/* Max limit */
>  	unsigned int	min_limit;		/* Minimum limit */
>  	unsigned int	slack_hold_time;	/* Time to measure slack */
> +
> +	unsigned char	stall_thrs;
> +	unsigned char	stall_max;

Why don't 'unsigned short'? 


> +	unsigned long	last_reap;
> +	unsigned long	stall_cnt;
>  };
>  
>  /* Set some static maximums */
> @@ -74,6 +86,8 @@ struct dql {
>   */
>  static inline void dql_queued(struct dql *dql, unsigned int count)
>  {
> +	unsigned long map, now, now_hi, i;
> +
>  	BUG_ON(count > DQL_MAX_OBJECT);
>  
>  	dql->last_obj_cnt = count;
> @@ -86,6 +100,27 @@ static inline void dql_queued(struct dql *dql, unsigned int count)
>  	barrier();
>  
>  	dql->num_queued += count;
> +
> +	now = jiffies;
> +	now_hi = now / BITS_PER_LONG;
> +	if (unlikely(now_hi != dql->history_head)) {
> +		/* About to reuse slots, clear them */
> +		for (i = 0; i < DQL_HIST_LEN; i++) {
> +			/* Multiplication masks high bits */
> +			if (now_hi * BITS_PER_LONG ==
> +			    (dql->history_head + i) * BITS_PER_LONG)
> +				break;
> +			DQL_HIST_ENT(dql, dql->history_head + i + 1) = 0;
> +		}
> +		/* pairs with smp_rmb() in dql_check_stall() */
> +		smp_wmb();
> +		WRITE_ONCE(dql->history_head, now_hi);
> +	}
> +
> +	/* __set_bit() does not guarantee WRITE_ONCE() semantics */
> +	map = DQL_HIST_ENT(dql, now_hi);
> +	if (!(map & BIT_MASK(now)))
> +		WRITE_ONCE(DQL_HIST_ENT(dql, now_hi), map | BIT_MASK(now));

Do you have measure of performance impact, if any? e.g. for udp
flood/pktgen/xdp tput tests with extreme pkt rate?

What about making all the above conditional to a non zero stall_thrs,
alike the check part?

>  }
>  
>  /* Returns how many objects can be queued, < 0 indicates over limit. */
> diff --git a/include/trace/events/napi.h b/include/trace/events/napi.h
> index 6678cf8b235b..272112ddaaa8 100644
> --- a/include/trace/events/napi.h
> +++ b/include/trace/events/napi.h
> @@ -36,6 +36,39 @@ TRACE_EVENT(napi_poll,
>  		  __entry->work, __entry->budget)
>  );
>  
> +TRACE_EVENT(dql_stall_detected,
> +
> +	TP_PROTO(unsigned int thrs, unsigned int len,
> +		 unsigned long last_reap, unsigned long hist_head,
> +		 unsigned long now, unsigned long *hist),
> +
> +	TP_ARGS(thrs, len, last_reap, hist_head, now, hist),
> +
> +	TP_STRUCT__entry(
> +		__field(	unsigned int,		thrs)
> +		__field(	unsigned int,		len)
> +		__field(	unsigned long,		last_reap)
> +		__field(	unsigned long,		hist_head)
> +		__field(	unsigned long,		now)
> +		__array(	unsigned long,		hist, 4)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->thrs = thrs;
> +		__entry->len = len;
> +		__entry->last_reap = last_reap;
> +		__entry->hist_head = hist_head * BITS_PER_LONG;
> +		__entry->now = now;
> +		memcpy(__entry->hist, hist, sizeof(entry->hist));
> +	),
> +
> +	TP_printk("thrs %u  len %u  last_reap %lu  hist_head %lu  now %lu  hist %016lx %016lx %016lx %016lx",
> +		  __entry->thrs, __entry->len,
> +		  __entry->last_reap, __entry->hist_head, __entry->now,
> +		  __entry->hist[0], __entry->hist[1],
> +		  __entry->hist[2], __entry->hist[3])
> +);
> +
>  #undef NO_DEV
>  
>  #endif /* _TRACE_NAPI_H */
> diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
> index fde0aa244148..162d30ae542c 100644
> --- a/lib/dynamic_queue_limits.c
> +++ b/lib/dynamic_queue_limits.c
> @@ -10,10 +10,61 @@
>  #include <linux/dynamic_queue_limits.h>
>  #include <linux/compiler.h>
>  #include <linux/export.h>
> +#include <trace/events/napi.h>
>  
>  #define POSDIFF(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0)
>  #define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0)
>  
> +static void dql_check_stall(struct dql *dql)
> +{
> +	unsigned long stall_thrs, now;
> +
> +	/* If we detect a stall see if anything was queued */
> +	stall_thrs = READ_ONCE(dql->stall_thrs);
> +	if (!stall_thrs)
> +		return;
> +
> +	now = jiffies;
> +	if (time_after_eq(now, dql->last_reap + stall_thrs)) {
> +		unsigned long hist_head, t, start, end;
> +
> +		/* We are trying to detect a period of at least @stall_thrs
> +		 * jiffies without any Tx completions, but during first half
> +		 * of which some Tx was posted.
> +		 */
> +dqs_again:
> +		hist_head = READ_ONCE(dql->history_head);
> +		/* pairs with smp_wmb() in dql_queued() */
> +		smp_rmb();
> +		/* oldest sample since last reap */
> +		start = (hist_head - DQL_HIST_LEN + 1) * BITS_PER_LONG;
> +		if (time_before(start, dql->last_reap + 1))
> +			start = dql->last_reap + 1;
> +		/* newest sample we should have already seen a completion for */
> +		end = hist_head * BITS_PER_LONG + (BITS_PER_LONG - 1);
> +		if (time_before(now, end + stall_thrs / 2))
> +			end = now - stall_thrs / 2;
> +
> +		for (t = start; time_before_eq(t, end); t++)
> +			if (test_bit(t % (DQL_HIST_LEN * BITS_PER_LONG),
> +				     dql->history))
> +				break;
> +		if (!time_before_eq(t, end))
> +			goto no_stall;
> +		if (hist_head != READ_ONCE(dql->history_head))
> +			goto dqs_again;
> +
> +		dql->stall_cnt++;
> +		dql->stall_max = max_t(unsigned int, dql->stall_max, now - t);
> +
> +		trace_dql_stall_detected(dql->stall_thrs, now - t,
> +					 dql->last_reap, dql->history_head,
> +					 now, dql->history);
> +	}
> +no_stall:
> +	dql->last_reap = now;
> +}
> +
>  /* Records completed count and recalculates the queue limit */
>  void dql_completed(struct dql *dql, unsigned int count)
>  {
> @@ -110,6 +161,8 @@ void dql_completed(struct dql *dql, unsigned int count)
>  	dql->prev_last_obj_cnt = dql->last_obj_cnt;
>  	dql->num_completed = completed;
>  	dql->prev_num_queued = num_queued;
> +
> +	dql_check_stall(dql);
>  }
>  EXPORT_SYMBOL(dql_completed);
>  
> @@ -125,6 +178,10 @@ void dql_reset(struct dql *dql)
>  	dql->prev_ovlimit = 0;
>  	dql->lowest_slack = UINT_MAX;
>  	dql->slack_start_time = jiffies;
> +
> +	dql->last_reap = jiffies;
> +	dql->history_head = jiffies / BITS_PER_LONG;
> +	memset(dql->history, 0, sizeof(dql->history));
>  }
>  EXPORT_SYMBOL(dql_reset);
>  
> @@ -133,6 +190,7 @@ void dql_init(struct dql *dql, unsigned int hold_time)
>  	dql->max_limit = DQL_MAX_LIMIT;
>  	dql->min_limit = 0;
>  	dql->slack_hold_time = hold_time;
> +	dql->stall_thrs = 0;
>  	dql_reset(dql);
>  }
>  EXPORT_SYMBOL(dql_init);
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index a09d507c5b03..94a622b0bb55 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -1409,6 +1409,65 @@ static struct netdev_queue_attribute bql_hold_time_attribute __ro_after_init
>  	= __ATTR(hold_time, 0644,
>  		 bql_show_hold_time, bql_set_hold_time);
>  
> +static ssize_t bql_show_stall_thrs(struct netdev_queue *queue, char *buf)
> +{
> +	struct dql *dql = &queue->dql;
> +
> +	return sprintf(buf, "%u\n", jiffies_to_msecs(dql->stall_thrs));
> +}
> +
> +static ssize_t bql_set_stall_thrs(struct netdev_queue *queue,
> +				  const char *buf, size_t len)
> +{
> +	struct dql *dql = &queue->dql;
> +	unsigned int value;
> +	int err;
> +
> +	err = kstrtouint(buf, 10, &value);
> +	if (err < 0)
> +		return err;
> +
> +	value = msecs_to_jiffies(value);
> +	if (value && (value < 4 || value > 4 / 2 * BITS_PER_LONG))

I admit I'm more than a bit lost with constant usage.. why 4/2 here?
Perhaps some more comments describing the implement logic would be
helpful.

Thanks!

Paolo
Breno Leitao Feb. 13, 2024, 1:44 p.m. UTC | #2
On Tue, Feb 06, 2024 at 12:40:13PM +0100, Paolo Abeni wrote:
> On Fri, 2024-02-02 at 08:53 -0800, Breno Leitao wrote:
> > From: Jakub Kicinski <kuba@kernel.org>
> > 
> > softnet_data->time_squeeze is sometimes used as a proxy for
> > host overload or indication of scheduling problems. In practice
> > this statistic is very noisy and has hard to grasp units -
> > e.g. is 10 squeezes a second to be expected, or high?
> > 
> > Delaying network (NAPI) processing leads to drops on NIC queues
> > but also RTT bloat, impacting pacing and CA decisions.
> > Stalls are a little hard to detect on the Rx side, because
> > there may simply have not been any packets received in given
> > period of time. Packet timestamps help a little bit, but
> > again we don't know if packets are stale because we're
> > not keeping up or because someone (*cough* cgroups)
> > disabled IRQs for a long time.
> > 
> > We can, however, use Tx as a proxy for Rx stalls. Most drivers
> > use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
> > On the Tx side we know exactly when packets get queued,
> > and completed, so there is no uncertainty.
> > 
> > This patch adds stall checks to BQL. Why BQL? Because
> > it's a convenient place to add such checks, already
> > called by most drivers, and it has copious free space
> > in its structures (this patch adds no extra cache
> > references or dirtying to the fast path).
> > 
> > The algorithm takes one parameter - max delay AKA stall
> > threshold and increments a counter whenever NAPI got delayed
> > for at least that amount of time. It also records the length
> > of the longest stall.
> > 
> > To be precise every time NAPI has not polled for at least
> > stall thrs we check if there were any Tx packets queued
> > between last NAPI run and now - stall_thrs/2.
> > 
> > Unlike the classic Tx watchdog this mechanism does not
> > ignore stalls caused by Tx being disabled, or loss of link.
> > I don't think the check is worth the complexity, and
> > stall is a stall, whether due to host overload, flow
> > control, link down... doesn't matter much to the application.
> > 
> > We have been running this detector in production at Meta
> > for 2 years, with the threshold of 8ms. It's the lowest
> > value where false positives become rare. There's still
> > a constant stream of reported stalls (especially without
> > the ksoftirqd deferral patches reverted), those who like
> > their stall metrics to be 0 may prefer higher value.
> > 
> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > ---
> > Changelog:
> > 
> > v1:
> >   * https://lore.kernel.org/netdev/202306172057.jx7YhLiu-lkp@intel.com/T/
> > 
> > v2:
> >   * Fix the documentation file in patch 0001, since patch 0002 will
> >     touch it later.
> >   * Fix the kernel test robot issues, marking functions as statics.
> >   * Use #include <linux/bitops.h> instead of <asm/bitops.h>.
> >   * Added some new comments around, mainly around barriers.
> >   * Format struct `netdev_queue_attribute` assignments to make
> >     checkpatch happy.
> >   * Updated and fixed the path in sysfs-class-net-queues
> >     documentation.
> > 
> > v3:
> >   * Sending patch 0002 against net-next.
> > 	- The first patch was accepted against 'net'
> > 
> > ---
> >  .../ABI/testing/sysfs-class-net-queues        | 23 +++++++
> >  include/linux/dynamic_queue_limits.h          | 35 +++++++++++
> >  include/trace/events/napi.h                   | 33 ++++++++++
> >  lib/dynamic_queue_limits.c                    | 58 +++++++++++++++++
> >  net/core/net-sysfs.c                          | 62 +++++++++++++++++++
> >  5 files changed, 211 insertions(+)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-class-net-queues b/Documentation/ABI/testing/sysfs-class-net-queues
> > index 5bff64d256c2..45bab9b6e3ae 100644
> > --- a/Documentation/ABI/testing/sysfs-class-net-queues
> > +++ b/Documentation/ABI/testing/sysfs-class-net-queues
> > @@ -96,3 +96,26 @@ Description:
> >  		Indicates the absolute minimum limit of bytes allowed to be
> >  		queued on this network device transmit queue. Default value is
> >  		0.
> > +
> > +What:		/sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_thrs
> > +Date:		Jan 2024
> > +KernelVersion:	6.9
> > +Contact:	netdev@vger.kernel.org
> > +Description:
> > +		Tx completion stall detection threshold. 
> 
> Possibly worth mentioning it's in milliseconds
> 
> > Kernel will guarantee
> > +		to detect all stalls longer than this threshold but may also
> > +		detect stalls longer than half of the threshold.
> > +
> > +What:		/sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_cnt
> > +Date:		Jan 2024
> > +KernelVersion:	6.9
> > +Contact:	netdev@vger.kernel.org
> > +Description:
> > +		Number of detected Tx completion stalls.
> > +
> > +What:		/sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_max
> > +Date:		Jan 2024
> > +KernelVersion:	6.9
> > +Contact:	netdev@vger.kernel.org
> > +Description:
> > +		Longest detected Tx completion stall. Write 0 to clear.
> > diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
> > index 407c2f281b64..288e98fe85f0 100644
> > --- a/include/linux/dynamic_queue_limits.h
> > +++ b/include/linux/dynamic_queue_limits.h
> > @@ -38,14 +38,21 @@
> >  
> >  #ifdef __KERNEL__
> >  
> > +#include <linux/bitops.h>
> >  #include <asm/bug.h>
> >  
> > +#define DQL_HIST_LEN		4
> > +#define DQL_HIST_ENT(dql, idx)	((dql)->history[(idx) % DQL_HIST_LEN])
> > +
> >  struct dql {
> >  	/* Fields accessed in enqueue path (dql_queued) */
> >  	unsigned int	num_queued;		/* Total ever queued */
> >  	unsigned int	adj_limit;		/* limit + num_completed */
> >  	unsigned int	last_obj_cnt;		/* Count at last queuing */
> >  
> > +	unsigned long	history_head;
> > +	unsigned long	history[DQL_HIST_LEN];
> > +
> >  	/* Fields accessed only by completion path (dql_completed) */
> >  
> >  	unsigned int	limit ____cacheline_aligned_in_smp; /* Current limit */
> > @@ -62,6 +69,11 @@ struct dql {
> >  	unsigned int	max_limit;		/* Max limit */
> >  	unsigned int	min_limit;		/* Minimum limit */
> >  	unsigned int	slack_hold_time;	/* Time to measure slack */
> > +
> > +	unsigned char	stall_thrs;
> > +	unsigned char	stall_max;
> 
> Why don't 'unsigned short'? 

Yes, it seems fine for me.

> 
> 
> > +	unsigned long	last_reap;
> > +	unsigned long	stall_cnt;
> >  };
> >  
> >  /* Set some static maximums */
> > @@ -74,6 +86,8 @@ struct dql {
> >   */
> >  static inline void dql_queued(struct dql *dql, unsigned int count)
> >  {
> > +	unsigned long map, now, now_hi, i;
> > +
> >  	BUG_ON(count > DQL_MAX_OBJECT);
> >  
> >  	dql->last_obj_cnt = count;
> > @@ -86,6 +100,27 @@ static inline void dql_queued(struct dql *dql, unsigned int count)
> >  	barrier();
> >  
> >  	dql->num_queued += count;
> > +
> > +	now = jiffies;
> > +	now_hi = now / BITS_PER_LONG;
> > +	if (unlikely(now_hi != dql->history_head)) {
> > +		/* About to reuse slots, clear them */
> > +		for (i = 0; i < DQL_HIST_LEN; i++) {
> > +			/* Multiplication masks high bits */
> > +			if (now_hi * BITS_PER_LONG ==
> > +			    (dql->history_head + i) * BITS_PER_LONG)
> > +				break;
> > +			DQL_HIST_ENT(dql, dql->history_head + i + 1) = 0;
> > +		}
> > +		/* pairs with smp_rmb() in dql_check_stall() */
> > +		smp_wmb();
> > +		WRITE_ONCE(dql->history_head, now_hi);
> > +	}
> > +
> > +	/* __set_bit() does not guarantee WRITE_ONCE() semantics */
> > +	map = DQL_HIST_ENT(dql, now_hi);
> > +	if (!(map & BIT_MASK(now)))
> > +		WRITE_ONCE(DQL_HIST_ENT(dql, now_hi), map | BIT_MASK(now));
> 
> Do you have measure of performance impact, if any? e.g. for udp
> flood/pktgen/xdp tput tests with extreme pkt rate?

I've looked at our Meta Internal profiler, if that helps. Currently Meta has this feature
enabled in every machine.

In the profiler, the only function that shows up in the profiler is dql_completed(),
which consumes 0.00204% of all kernel cycles. Does it help?


> What about making all the above conditional to a non zero stall_thrs,
> alike the check part?

Makes sense, I will update.

> >  }
> >  
> >  /* Returns how many objects can be queued, < 0 indicates over limit. */
> > diff --git a/include/trace/events/napi.h b/include/trace/events/napi.h
> > index 6678cf8b235b..272112ddaaa8 100644
> > --- a/include/trace/events/napi.h
> > +++ b/include/trace/events/napi.h
> > @@ -36,6 +36,39 @@ TRACE_EVENT(napi_poll,
> >  		  __entry->work, __entry->budget)
> >  );
> >  
> > +TRACE_EVENT(dql_stall_detected,
> > +
> > +	TP_PROTO(unsigned int thrs, unsigned int len,
> > +		 unsigned long last_reap, unsigned long hist_head,
> > +		 unsigned long now, unsigned long *hist),
> > +
> > +	TP_ARGS(thrs, len, last_reap, hist_head, now, hist),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	unsigned int,		thrs)
> > +		__field(	unsigned int,		len)
> > +		__field(	unsigned long,		last_reap)
> > +		__field(	unsigned long,		hist_head)
> > +		__field(	unsigned long,		now)
> > +		__array(	unsigned long,		hist, 4)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->thrs = thrs;
> > +		__entry->len = len;
> > +		__entry->last_reap = last_reap;
> > +		__entry->hist_head = hist_head * BITS_PER_LONG;
> > +		__entry->now = now;
> > +		memcpy(__entry->hist, hist, sizeof(entry->hist));
> > +	),
> > +
> > +	TP_printk("thrs %u  len %u  last_reap %lu  hist_head %lu  now %lu  hist %016lx %016lx %016lx %016lx",
> > +		  __entry->thrs, __entry->len,
> > +		  __entry->last_reap, __entry->hist_head, __entry->now,
> > +		  __entry->hist[0], __entry->hist[1],
> > +		  __entry->hist[2], __entry->hist[3])
> > +);
> > +
> >  #undef NO_DEV
> >  
> >  #endif /* _TRACE_NAPI_H */
> > diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
> > index fde0aa244148..162d30ae542c 100644
> > --- a/lib/dynamic_queue_limits.c
> > +++ b/lib/dynamic_queue_limits.c
> > @@ -10,10 +10,61 @@
> >  #include <linux/dynamic_queue_limits.h>
> >  #include <linux/compiler.h>
> >  #include <linux/export.h>
> > +#include <trace/events/napi.h>
> >  
> >  #define POSDIFF(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0)
> >  #define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0)
> >  
> > +static void dql_check_stall(struct dql *dql)
> > +{
> > +	unsigned long stall_thrs, now;
> > +
> > +	/* If we detect a stall see if anything was queued */
> > +	stall_thrs = READ_ONCE(dql->stall_thrs);
> > +	if (!stall_thrs)
> > +		return;
> > +
> > +	now = jiffies;
> > +	if (time_after_eq(now, dql->last_reap + stall_thrs)) {
> > +		unsigned long hist_head, t, start, end;
> > +
> > +		/* We are trying to detect a period of at least @stall_thrs
> > +		 * jiffies without any Tx completions, but during first half
> > +		 * of which some Tx was posted.
> > +		 */
> > +dqs_again:
> > +		hist_head = READ_ONCE(dql->history_head);
> > +		/* pairs with smp_wmb() in dql_queued() */
> > +		smp_rmb();
> > +		/* oldest sample since last reap */
> > +		start = (hist_head - DQL_HIST_LEN + 1) * BITS_PER_LONG;
> > +		if (time_before(start, dql->last_reap + 1))
> > +			start = dql->last_reap + 1;
> > +		/* newest sample we should have already seen a completion for */
> > +		end = hist_head * BITS_PER_LONG + (BITS_PER_LONG - 1);
> > +		if (time_before(now, end + stall_thrs / 2))
> > +			end = now - stall_thrs / 2;
> > +
> > +		for (t = start; time_before_eq(t, end); t++)
> > +			if (test_bit(t % (DQL_HIST_LEN * BITS_PER_LONG),
> > +				     dql->history))
> > +				break;
> > +		if (!time_before_eq(t, end))
> > +			goto no_stall;
> > +		if (hist_head != READ_ONCE(dql->history_head))
> > +			goto dqs_again;
> > +
> > +		dql->stall_cnt++;
> > +		dql->stall_max = max_t(unsigned int, dql->stall_max, now - t);
> > +
> > +		trace_dql_stall_detected(dql->stall_thrs, now - t,
> > +					 dql->last_reap, dql->history_head,
> > +					 now, dql->history);
> > +	}
> > +no_stall:
> > +	dql->last_reap = now;
> > +}
> > +
> >  /* Records completed count and recalculates the queue limit */
> >  void dql_completed(struct dql *dql, unsigned int count)
> >  {
> > @@ -110,6 +161,8 @@ void dql_completed(struct dql *dql, unsigned int count)
> >  	dql->prev_last_obj_cnt = dql->last_obj_cnt;
> >  	dql->num_completed = completed;
> >  	dql->prev_num_queued = num_queued;
> > +
> > +	dql_check_stall(dql);
> >  }
> >  EXPORT_SYMBOL(dql_completed);
> >  
> > @@ -125,6 +178,10 @@ void dql_reset(struct dql *dql)
> >  	dql->prev_ovlimit = 0;
> >  	dql->lowest_slack = UINT_MAX;
> >  	dql->slack_start_time = jiffies;
> > +
> > +	dql->last_reap = jiffies;
> > +	dql->history_head = jiffies / BITS_PER_LONG;
> > +	memset(dql->history, 0, sizeof(dql->history));
> >  }
> >  EXPORT_SYMBOL(dql_reset);
> >  
> > @@ -133,6 +190,7 @@ void dql_init(struct dql *dql, unsigned int hold_time)
> >  	dql->max_limit = DQL_MAX_LIMIT;
> >  	dql->min_limit = 0;
> >  	dql->slack_hold_time = hold_time;
> > +	dql->stall_thrs = 0;
> >  	dql_reset(dql);
> >  }
> >  EXPORT_SYMBOL(dql_init);
> > diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> > index a09d507c5b03..94a622b0bb55 100644
> > --- a/net/core/net-sysfs.c
> > +++ b/net/core/net-sysfs.c
> > @@ -1409,6 +1409,65 @@ static struct netdev_queue_attribute bql_hold_time_attribute __ro_after_init
> >  	= __ATTR(hold_time, 0644,
> >  		 bql_show_hold_time, bql_set_hold_time);
> >  
> > +static ssize_t bql_show_stall_thrs(struct netdev_queue *queue, char *buf)
> > +{
> > +	struct dql *dql = &queue->dql;
> > +
> > +	return sprintf(buf, "%u\n", jiffies_to_msecs(dql->stall_thrs));
> > +}
> > +
> > +static ssize_t bql_set_stall_thrs(struct netdev_queue *queue,
> > +				  const char *buf, size_t len)
> > +{
> > +	struct dql *dql = &queue->dql;
> > +	unsigned int value;
> > +	int err;
> > +
> > +	err = kstrtouint(buf, 10, &value);
> > +	if (err < 0)
> > +		return err;
> > +
> > +	value = msecs_to_jiffies(value);
> > +	if (value && (value < 4 || value > 4 / 2 * BITS_PER_LONG))
> 
> I admit I'm more than a bit lost with constant usage.. why 4/2 here?

4 in this case is DQL_HIST_LEN, so, it needs to be updated.

> Perhaps some more comments describing the implement logic would be
> helpful.

Agree, Let me work on it.
Eric Dumazet Feb. 13, 2024, 1:57 p.m. UTC | #3
On Fri, Feb 2, 2024 at 5:55 PM Breno Leitao <leitao@debian.org> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
>
> softnet_data->time_squeeze is sometimes used as a proxy for
> host overload or indication of scheduling problems. In practice
> this statistic is very noisy and has hard to grasp units -
> e.g. is 10 squeezes a second to be expected, or high?
>
> Delaying network (NAPI) processing leads to drops on NIC queues
> but also RTT bloat, impacting pacing and CA decisions.
> Stalls are a little hard to detect on the Rx side, because
> there may simply have not been any packets received in given
> period of time. Packet timestamps help a little bit, but
> again we don't know if packets are stale because we're
> not keeping up or because someone (*cough* cgroups)
> disabled IRQs for a long time.

Please note that adding other sysfs entries is expensive for workloads
creating/deleting netdev and netns often.

I _think_ we should find a way for not creating
/sys/class/net/<interface>/queues/tx-{Q}/byte_queue_limits  directory
and files
for non BQL enabled devices (like loopback !)
Jakub Kicinski Feb. 13, 2024, 6:04 p.m. UTC | #4
On Tue, 13 Feb 2024 14:57:49 +0100 Eric Dumazet wrote:
> Please note that adding other sysfs entries is expensive for workloads
> creating/deleting netdev and netns often.
> 
> I _think_ we should find a way for not creating
> /sys/class/net/<interface>/queues/tx-{Q}/byte_queue_limits  directory
> and files
> for non BQL enabled devices (like loopback !)

We should try, see if anyone screams. We could use IFF_NO_QUEUE, and
NETIF_F_LLTX as a proxy for "device doesn't have a real queue so BQL 
would be pointless"? Obviously better to annotate the drivers which
do have BQL support, but there's >50 of them on a quick count..
Breno Leitao Feb. 14, 2024, 2:45 p.m. UTC | #5
On Tue, Feb 13, 2024 at 10:04:57AM -0800, Jakub Kicinski wrote:
> On Tue, 13 Feb 2024 14:57:49 +0100 Eric Dumazet wrote:
> > Please note that adding other sysfs entries is expensive for workloads
> > creating/deleting netdev and netns often.
> > 
> > I _think_ we should find a way for not creating
> > /sys/class/net/<interface>/queues/tx-{Q}/byte_queue_limits  directory
> > and files
> > for non BQL enabled devices (like loopback !)
> 
> We should try, see if anyone screams. We could use IFF_NO_QUEUE, and
> NETIF_F_LLTX as a proxy for "device doesn't have a real queue so BQL 
> would be pointless"? Obviously better to annotate the drivers which
> do have BQL support, but there's >50 of them on a quick count..

Let me make sure I understand the suggestion above. We want to disable
BQL completely for devices that has dev->features & NETIF_F_LLTX or
dev->priv_flags & IFF_NO_QUEUE, right?

Maybe we can add a ->enabled field in struct dql, and set it according
to the features above. Then we can created the sysfs and process the dql
operations based on that field. This should avoid some unnecessary calls
also, if we are not display sysfs.

Here is a very simple PoC to represent what I had in mind. Am I in the
right direction?

diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
index 407c2f281b64..a9d17597ea80 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -62,6 +62,7 @@ struct dql {
 	unsigned int	max_limit;		/* Max limit */
 	unsigned int	min_limit;		/* Minimum limit */
 	unsigned int	slack_hold_time;	/* Time to measure slack */
+	bool		enabled;		/* Queue is active */
 };
 
 /* Set some static maximums */
@@ -101,7 +102,7 @@ void dql_completed(struct dql *dql, unsigned int count);
 void dql_reset(struct dql *dql);
 
 /* Initialize dql state */
-void dql_init(struct dql *dql, unsigned int hold_time);
+void dql_init(struct net_device *dev, struct dql *dql, unsigned int hold_time);
 
 #endif /* _KERNEL_ */
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ef7bfbb98497..5c69bbf4267d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3541,6 +3541,9 @@ static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue,
 					unsigned int bytes)
 {
 #ifdef CONFIG_BQL
+	if (!dev_queue->dql.enabled)
+		return
+
 	dql_queued(&dev_queue->dql, bytes);
 
 	if (likely(dql_avail(&dev_queue->dql) >= 0))
@@ -3573,7 +3576,8 @@ static inline bool __netdev_tx_sent_queue(struct netdev_queue *dev_queue,
 {
 	if (xmit_more) {
 #ifdef CONFIG_BQL
-		dql_queued(&dev_queue->dql, bytes);
+		if (dev_queue->dql.enabled)
+			dql_queued(&dev_queue->dql, bytes);
 #endif
 		return netif_tx_queue_stopped(dev_queue);
 	}
@@ -3617,7 +3621,7 @@ static inline void netdev_tx_completed_queue(struct netdev_queue *dev_queue,
 					     unsigned int pkts, unsigned int bytes)
 {
 #ifdef CONFIG_BQL
-	if (unlikely(!bytes))
+	if (unlikely(!bytes) || !dev_queue->dql.enabled)
 		return;
 
 	dql_completed(&dev_queue->dql, bytes);
@@ -3656,6 +3660,9 @@ static inline void netdev_completed_queue(struct net_device *dev,
 static inline void netdev_tx_reset_queue(struct netdev_queue *q)
 {
 #ifdef CONFIG_BQL
+	if (!q->dql.enabled)
+		return;
+
 	clear_bit(__QUEUE_STATE_STACK_XOFF, &q->state);
 	dql_reset(&q->dql);
 #endif
diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
index fde0aa244148..0a0a51f06c3b 100644
--- a/lib/dynamic_queue_limits.c
+++ b/lib/dynamic_queue_limits.c
@@ -10,6 +10,7 @@
 #include <linux/dynamic_queue_limits.h>
 #include <linux/compiler.h>
 #include <linux/export.h>
+#include <linux/netdevice.h>
 
 #define POSDIFF(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0)
 #define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0)
@@ -128,11 +129,21 @@ void dql_reset(struct dql *dql)
 }
 EXPORT_SYMBOL(dql_reset);
 
-void dql_init(struct dql *dql, unsigned int hold_time)
+static bool netdev_dql_supported(struct net_device *dev)
+{
+	if (dev->features & NETIF_F_LLTX ||
+	    dev->priv_flags & IFF_NO_QUEUE)
+		return false;
+
+	return true;
+}
+
+void dql_init(struct net_device *dev, struct dql *dql, unsigned int hold_time)
 {
 	dql->max_limit = DQL_MAX_LIMIT;
 	dql->min_limit = 0;
 	dql->slack_hold_time = hold_time;
 	dql_reset(dql);
+	dql->enabled = netdev_dql_supported(dev);
 }
 EXPORT_SYMBOL(dql_init);
diff --git a/net/core/dev.c b/net/core/dev.c
index 9bb792cecc16..76aa70ee2c87 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10052,7 +10052,7 @@ static void netdev_init_one_queue(struct net_device *dev,
 	netdev_queue_numa_node_write(queue, NUMA_NO_NODE);
 	queue->dev = dev;
 #ifdef CONFIG_BQL
-	dql_init(&queue->dql, HZ);
+	dql_init(dev, &queue->dql, HZ);
 #endif
 }
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index a09d507c5b03..144ce4bb57bc 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1709,9 +1709,11 @@ static int netdev_queue_add_kobject(struct net_device *dev, int index)
 		goto err;
 
 #ifdef CONFIG_BQL
-	error = sysfs_create_group(kobj, &dql_group);
-	if (error)
-		goto err;
+	if (queue->dql.enabled) {
+		error = sysfs_create_group(kobj, &dql_group);
+		if (error)
+			goto err;
+	}
 #endif
 
 	kobject_uevent(kobj, KOBJ_ADD);
Eric Dumazet Feb. 14, 2024, 3:41 p.m. UTC | #6
On Wed, Feb 14, 2024 at 3:45 PM Breno Leitao <leitao@debian.org> wrote:
>
> On Tue, Feb 13, 2024 at 10:04:57AM -0800, Jakub Kicinski wrote:
> > On Tue, 13 Feb 2024 14:57:49 +0100 Eric Dumazet wrote:
> > > Please note that adding other sysfs entries is expensive for workloads
> > > creating/deleting netdev and netns often.
> > >
> > > I _think_ we should find a way for not creating
> > > /sys/class/net/<interface>/queues/tx-{Q}/byte_queue_limits  directory
> > > and files
> > > for non BQL enabled devices (like loopback !)
> >
> > We should try, see if anyone screams. We could use IFF_NO_QUEUE, and
> > NETIF_F_LLTX as a proxy for "device doesn't have a real queue so BQL
> > would be pointless"? Obviously better to annotate the drivers which
> > do have BQL support, but there's >50 of them on a quick count..
>
> Let me make sure I understand the suggestion above. We want to disable
> BQL completely for devices that has dev->features & NETIF_F_LLTX or
> dev->priv_flags & IFF_NO_QUEUE, right?
>
> Maybe we can add a ->enabled field in struct dql, and set it according
> to the features above. Then we can created the sysfs and process the dql
> operations based on that field. This should avoid some unnecessary calls
> also, if we are not display sysfs.
>
> Here is a very simple PoC to represent what I had in mind. Am I in the
> right direction?

No, this was really about sysfs entries (aka dql_group)

Partial patch would be:

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index a09d507c5b03d24a829bf7af0b7cf1e6a0bdb65a..094e3b2d78cca40d810b2fa3bd4393d22b30e6ad
100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1709,9 +1709,11 @@ static int netdev_queue_add_kobject(struct
net_device *dev, int index)
                goto err;

 #ifdef CONFIG_BQL
-       error = sysfs_create_group(kobj, &dql_group);
-       if (error)
-               goto err;
+       if (netdev_uses_bql(dev)) {
+               error = sysfs_create_group(kobj, &dql_group);
+               if (error)
+                       goto err;
+       }
 #endif

        kobject_uevent(kobj, KOBJ_ADD);
@@ -1734,7 +1736,8 @@ static int tx_queue_change_owner(struct
net_device *ndev, int index,
                return error;

 #ifdef CONFIG_BQL
-       error = sysfs_group_change_owner(kobj, &dql_group, kuid, kgid);
+       if (netdev_uses_bql(ndev))
+               error = sysfs_group_change_owner(kobj, &dql_group, kuid, kgid);
 #endif
        return error;
 }
@@ -1768,7 +1771,8 @@ netdev_queue_update_kobjects(struct net_device
*dev, int old_num, int new_num)
                if (!refcount_read(&dev_net(dev)->ns.count))
                        queue->kobj.uevent_suppress = 1;
 #ifdef CONFIG_BQL
-               sysfs_remove_group(&queue->kobj, &dql_group);
+               if (netdev_uses_bql(dev))
+                       sysfs_remove_group(&queue->kobj, &dql_group);
 #endif
                kobject_put(&queue->kobj);
        }
Breno Leitao Feb. 14, 2024, 4:49 p.m. UTC | #7
On Wed, Feb 14, 2024 at 04:41:36PM +0100, Eric Dumazet wrote:
> On Wed, Feb 14, 2024 at 3:45 PM Breno Leitao <leitao@debian.org> wrote:
> >
> > On Tue, Feb 13, 2024 at 10:04:57AM -0800, Jakub Kicinski wrote:
> > > On Tue, 13 Feb 2024 14:57:49 +0100 Eric Dumazet wrote:
> > > > Please note that adding other sysfs entries is expensive for workloads
> > > > creating/deleting netdev and netns often.
> > > >
> > > > I _think_ we should find a way for not creating
> > > > /sys/class/net/<interface>/queues/tx-{Q}/byte_queue_limits  directory
> > > > and files
> > > > for non BQL enabled devices (like loopback !)
> > >
> > > We should try, see if anyone screams. We could use IFF_NO_QUEUE, and
> > > NETIF_F_LLTX as a proxy for "device doesn't have a real queue so BQL
> > > would be pointless"? Obviously better to annotate the drivers which
> > > do have BQL support, but there's >50 of them on a quick count..
> >
> > Let me make sure I understand the suggestion above. We want to disable
> > BQL completely for devices that has dev->features & NETIF_F_LLTX or
> > dev->priv_flags & IFF_NO_QUEUE, right?
> >
> > Maybe we can add a ->enabled field in struct dql, and set it according
> > to the features above. Then we can created the sysfs and process the dql
> > operations based on that field. This should avoid some unnecessary calls
> > also, if we are not display sysfs.
> >
> > Here is a very simple PoC to represent what I had in mind. Am I in the
> > right direction?
> 
> No, this was really about sysfs entries (aka dql_group)
> 
> Partial patch would be:

That is simpler than what I imagined. Thanks!

> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index a09d507c5b03d24a829bf7af0b7cf1e6a0bdb65a..094e3b2d78cca40d810b2fa3bd4393d22b30e6ad
> 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -1709,9 +1709,11 @@ static int netdev_queue_add_kobject(struct
> net_device *dev, int index)
>                 goto err;
> 
>  #ifdef CONFIG_BQL
> -       error = sysfs_create_group(kobj, &dql_group);
> -       if (error)
> -               goto err;
> +       if (netdev_uses_bql(dev)) {

for netdev_uses_bql(), would it be similar to what I proposed in the
previous message? Let me copy it here.

	static bool netdev_uses_bql(struct net_device *dev)
	{
	       if (dev->features & NETIF_F_LLTX ||
		   dev->priv_flags & IFF_NO_QUEUE)
		       return false;

	       return true;
	}

Thanks
Eric Dumazet Feb. 14, 2024, 4:58 p.m. UTC | #8
On Wed, Feb 14, 2024 at 5:49 PM Breno Leitao <leitao@debian.org> wrote:
>
> On Wed, Feb 14, 2024 at 04:41:36PM +0100, Eric Dumazet wrote:
> > On Wed, Feb 14, 2024 at 3:45 PM Breno Leitao <leitao@debian.org> wrote:
> > >
> > > On Tue, Feb 13, 2024 at 10:04:57AM -0800, Jakub Kicinski wrote:
> > > > On Tue, 13 Feb 2024 14:57:49 +0100 Eric Dumazet wrote:
> > > > > Please note that adding other sysfs entries is expensive for workloads
> > > > > creating/deleting netdev and netns often.
> > > > >
> > > > > I _think_ we should find a way for not creating
> > > > > /sys/class/net/<interface>/queues/tx-{Q}/byte_queue_limits  directory
> > > > > and files
> > > > > for non BQL enabled devices (like loopback !)
> > > >
> > > > We should try, see if anyone screams. We could use IFF_NO_QUEUE, and
> > > > NETIF_F_LLTX as a proxy for "device doesn't have a real queue so BQL
> > > > would be pointless"? Obviously better to annotate the drivers which
> > > > do have BQL support, but there's >50 of them on a quick count..
> > >
> > > Let me make sure I understand the suggestion above. We want to disable
> > > BQL completely for devices that has dev->features & NETIF_F_LLTX or
> > > dev->priv_flags & IFF_NO_QUEUE, right?
> > >
> > > Maybe we can add a ->enabled field in struct dql, and set it according
> > > to the features above. Then we can created the sysfs and process the dql
> > > operations based on that field. This should avoid some unnecessary calls
> > > also, if we are not display sysfs.
> > >
> > > Here is a very simple PoC to represent what I had in mind. Am I in the
> > > right direction?
> >
> > No, this was really about sysfs entries (aka dql_group)
> >
> > Partial patch would be:
>
> That is simpler than what I imagined. Thanks!
>

>
> for netdev_uses_bql(), would it be similar to what I proposed in the
> previous message? Let me copy it here.
>
>         static bool netdev_uses_bql(struct net_device *dev)
>         {
>                if (dev->features & NETIF_F_LLTX ||
>                    dev->priv_flags & IFF_NO_QUEUE)
>                        return false;
>
>                return true;
>         }

I think this should be fine, yes.
Breno Leitao Feb. 14, 2024, 5:31 p.m. UTC | #9
On Wed, Feb 14, 2024 at 05:58:37PM +0100, Eric Dumazet wrote:
> On Wed, Feb 14, 2024 at 5:49 PM Breno Leitao <leitao@debian.org> wrote:
> >
> > On Wed, Feb 14, 2024 at 04:41:36PM +0100, Eric Dumazet wrote:
> > > On Wed, Feb 14, 2024 at 3:45 PM Breno Leitao <leitao@debian.org> wrote:
> > > >
> > > > On Tue, Feb 13, 2024 at 10:04:57AM -0800, Jakub Kicinski wrote:
> > > > > On Tue, 13 Feb 2024 14:57:49 +0100 Eric Dumazet wrote:
> > > > > > Please note that adding other sysfs entries is expensive for workloads
> > > > > > creating/deleting netdev and netns often.
> > > > > >
> > > > > > I _think_ we should find a way for not creating
> > > > > > /sys/class/net/<interface>/queues/tx-{Q}/byte_queue_limits  directory
> > > > > > and files
> > > > > > for non BQL enabled devices (like loopback !)
> > > > >
> > > > > We should try, see if anyone screams. We could use IFF_NO_QUEUE, and
> > > > > NETIF_F_LLTX as a proxy for "device doesn't have a real queue so BQL
> > > > > would be pointless"? Obviously better to annotate the drivers which
> > > > > do have BQL support, but there's >50 of them on a quick count..
> > > >
> > > > Let me make sure I understand the suggestion above. We want to disable
> > > > BQL completely for devices that has dev->features & NETIF_F_LLTX or
> > > > dev->priv_flags & IFF_NO_QUEUE, right?
> > > >
> > > > Maybe we can add a ->enabled field in struct dql, and set it according
> > > > to the features above. Then we can created the sysfs and process the dql
> > > > operations based on that field. This should avoid some unnecessary calls
> > > > also, if we are not display sysfs.
> > > >
> > > > Here is a very simple PoC to represent what I had in mind. Am I in the
> > > > right direction?
> > >
> > > No, this was really about sysfs entries (aka dql_group)
> > >
> > > Partial patch would be:
> >
> > That is simpler than what I imagined. Thanks!
> >
> 
> >
> > for netdev_uses_bql(), would it be similar to what I proposed in the
> > previous message? Let me copy it here.
> >
> >         static bool netdev_uses_bql(struct net_device *dev)
> >         {
> >                if (dev->features & NETIF_F_LLTX ||
> >                    dev->priv_flags & IFF_NO_QUEUE)
> >                        return false;
> >
> >                return true;
> >         }
> 
> I think this should be fine, yes.

Awesome, thanks.

I am planning to send this in separate from the "net: dqs: add NIC stall
detector based on BQL" patch since there isn't really a dependency here.
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-class-net-queues b/Documentation/ABI/testing/sysfs-class-net-queues
index 5bff64d256c2..45bab9b6e3ae 100644
--- a/Documentation/ABI/testing/sysfs-class-net-queues
+++ b/Documentation/ABI/testing/sysfs-class-net-queues
@@ -96,3 +96,26 @@  Description:
 		Indicates the absolute minimum limit of bytes allowed to be
 		queued on this network device transmit queue. Default value is
 		0.
+
+What:		/sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_thrs
+Date:		Jan 2024
+KernelVersion:	6.9
+Contact:	netdev@vger.kernel.org
+Description:
+		Tx completion stall detection threshold. Kernel will guarantee
+		to detect all stalls longer than this threshold but may also
+		detect stalls longer than half of the threshold.
+
+What:		/sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_cnt
+Date:		Jan 2024
+KernelVersion:	6.9
+Contact:	netdev@vger.kernel.org
+Description:
+		Number of detected Tx completion stalls.
+
+What:		/sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_max
+Date:		Jan 2024
+KernelVersion:	6.9
+Contact:	netdev@vger.kernel.org
+Description:
+		Longest detected Tx completion stall. Write 0 to clear.
diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
index 407c2f281b64..288e98fe85f0 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -38,14 +38,21 @@ 
 
 #ifdef __KERNEL__
 
+#include <linux/bitops.h>
 #include <asm/bug.h>
 
+#define DQL_HIST_LEN		4
+#define DQL_HIST_ENT(dql, idx)	((dql)->history[(idx) % DQL_HIST_LEN])
+
 struct dql {
 	/* Fields accessed in enqueue path (dql_queued) */
 	unsigned int	num_queued;		/* Total ever queued */
 	unsigned int	adj_limit;		/* limit + num_completed */
 	unsigned int	last_obj_cnt;		/* Count at last queuing */
 
+	unsigned long	history_head;
+	unsigned long	history[DQL_HIST_LEN];
+
 	/* Fields accessed only by completion path (dql_completed) */
 
 	unsigned int	limit ____cacheline_aligned_in_smp; /* Current limit */
@@ -62,6 +69,11 @@  struct dql {
 	unsigned int	max_limit;		/* Max limit */
 	unsigned int	min_limit;		/* Minimum limit */
 	unsigned int	slack_hold_time;	/* Time to measure slack */
+
+	unsigned char	stall_thrs;
+	unsigned char	stall_max;
+	unsigned long	last_reap;
+	unsigned long	stall_cnt;
 };
 
 /* Set some static maximums */
@@ -74,6 +86,8 @@  struct dql {
  */
 static inline void dql_queued(struct dql *dql, unsigned int count)
 {
+	unsigned long map, now, now_hi, i;
+
 	BUG_ON(count > DQL_MAX_OBJECT);
 
 	dql->last_obj_cnt = count;
@@ -86,6 +100,27 @@  static inline void dql_queued(struct dql *dql, unsigned int count)
 	barrier();
 
 	dql->num_queued += count;
+
+	now = jiffies;
+	now_hi = now / BITS_PER_LONG;
+	if (unlikely(now_hi != dql->history_head)) {
+		/* About to reuse slots, clear them */
+		for (i = 0; i < DQL_HIST_LEN; i++) {
+			/* Multiplication masks high bits */
+			if (now_hi * BITS_PER_LONG ==
+			    (dql->history_head + i) * BITS_PER_LONG)
+				break;
+			DQL_HIST_ENT(dql, dql->history_head + i + 1) = 0;
+		}
+		/* pairs with smp_rmb() in dql_check_stall() */
+		smp_wmb();
+		WRITE_ONCE(dql->history_head, now_hi);
+	}
+
+	/* __set_bit() does not guarantee WRITE_ONCE() semantics */
+	map = DQL_HIST_ENT(dql, now_hi);
+	if (!(map & BIT_MASK(now)))
+		WRITE_ONCE(DQL_HIST_ENT(dql, now_hi), map | BIT_MASK(now));
 }
 
 /* Returns how many objects can be queued, < 0 indicates over limit. */
diff --git a/include/trace/events/napi.h b/include/trace/events/napi.h
index 6678cf8b235b..272112ddaaa8 100644
--- a/include/trace/events/napi.h
+++ b/include/trace/events/napi.h
@@ -36,6 +36,39 @@  TRACE_EVENT(napi_poll,
 		  __entry->work, __entry->budget)
 );
 
+TRACE_EVENT(dql_stall_detected,
+
+	TP_PROTO(unsigned int thrs, unsigned int len,
+		 unsigned long last_reap, unsigned long hist_head,
+		 unsigned long now, unsigned long *hist),
+
+	TP_ARGS(thrs, len, last_reap, hist_head, now, hist),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,		thrs)
+		__field(	unsigned int,		len)
+		__field(	unsigned long,		last_reap)
+		__field(	unsigned long,		hist_head)
+		__field(	unsigned long,		now)
+		__array(	unsigned long,		hist, 4)
+	),
+
+	TP_fast_assign(
+		__entry->thrs = thrs;
+		__entry->len = len;
+		__entry->last_reap = last_reap;
+		__entry->hist_head = hist_head * BITS_PER_LONG;
+		__entry->now = now;
+		memcpy(__entry->hist, hist, sizeof(entry->hist));
+	),
+
+	TP_printk("thrs %u  len %u  last_reap %lu  hist_head %lu  now %lu  hist %016lx %016lx %016lx %016lx",
+		  __entry->thrs, __entry->len,
+		  __entry->last_reap, __entry->hist_head, __entry->now,
+		  __entry->hist[0], __entry->hist[1],
+		  __entry->hist[2], __entry->hist[3])
+);
+
 #undef NO_DEV
 
 #endif /* _TRACE_NAPI_H */
diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
index fde0aa244148..162d30ae542c 100644
--- a/lib/dynamic_queue_limits.c
+++ b/lib/dynamic_queue_limits.c
@@ -10,10 +10,61 @@ 
 #include <linux/dynamic_queue_limits.h>
 #include <linux/compiler.h>
 #include <linux/export.h>
+#include <trace/events/napi.h>
 
 #define POSDIFF(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0)
 #define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0)
 
+static void dql_check_stall(struct dql *dql)
+{
+	unsigned long stall_thrs, now;
+
+	/* If we detect a stall see if anything was queued */
+	stall_thrs = READ_ONCE(dql->stall_thrs);
+	if (!stall_thrs)
+		return;
+
+	now = jiffies;
+	if (time_after_eq(now, dql->last_reap + stall_thrs)) {
+		unsigned long hist_head, t, start, end;
+
+		/* We are trying to detect a period of at least @stall_thrs
+		 * jiffies without any Tx completions, but during first half
+		 * of which some Tx was posted.
+		 */
+dqs_again:
+		hist_head = READ_ONCE(dql->history_head);
+		/* pairs with smp_wmb() in dql_queued() */
+		smp_rmb();
+		/* oldest sample since last reap */
+		start = (hist_head - DQL_HIST_LEN + 1) * BITS_PER_LONG;
+		if (time_before(start, dql->last_reap + 1))
+			start = dql->last_reap + 1;
+		/* newest sample we should have already seen a completion for */
+		end = hist_head * BITS_PER_LONG + (BITS_PER_LONG - 1);
+		if (time_before(now, end + stall_thrs / 2))
+			end = now - stall_thrs / 2;
+
+		for (t = start; time_before_eq(t, end); t++)
+			if (test_bit(t % (DQL_HIST_LEN * BITS_PER_LONG),
+				     dql->history))
+				break;
+		if (!time_before_eq(t, end))
+			goto no_stall;
+		if (hist_head != READ_ONCE(dql->history_head))
+			goto dqs_again;
+
+		dql->stall_cnt++;
+		dql->stall_max = max_t(unsigned int, dql->stall_max, now - t);
+
+		trace_dql_stall_detected(dql->stall_thrs, now - t,
+					 dql->last_reap, dql->history_head,
+					 now, dql->history);
+	}
+no_stall:
+	dql->last_reap = now;
+}
+
 /* Records completed count and recalculates the queue limit */
 void dql_completed(struct dql *dql, unsigned int count)
 {
@@ -110,6 +161,8 @@  void dql_completed(struct dql *dql, unsigned int count)
 	dql->prev_last_obj_cnt = dql->last_obj_cnt;
 	dql->num_completed = completed;
 	dql->prev_num_queued = num_queued;
+
+	dql_check_stall(dql);
 }
 EXPORT_SYMBOL(dql_completed);
 
@@ -125,6 +178,10 @@  void dql_reset(struct dql *dql)
 	dql->prev_ovlimit = 0;
 	dql->lowest_slack = UINT_MAX;
 	dql->slack_start_time = jiffies;
+
+	dql->last_reap = jiffies;
+	dql->history_head = jiffies / BITS_PER_LONG;
+	memset(dql->history, 0, sizeof(dql->history));
 }
 EXPORT_SYMBOL(dql_reset);
 
@@ -133,6 +190,7 @@  void dql_init(struct dql *dql, unsigned int hold_time)
 	dql->max_limit = DQL_MAX_LIMIT;
 	dql->min_limit = 0;
 	dql->slack_hold_time = hold_time;
+	dql->stall_thrs = 0;
 	dql_reset(dql);
 }
 EXPORT_SYMBOL(dql_init);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index a09d507c5b03..94a622b0bb55 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1409,6 +1409,65 @@  static struct netdev_queue_attribute bql_hold_time_attribute __ro_after_init
 	= __ATTR(hold_time, 0644,
 		 bql_show_hold_time, bql_set_hold_time);
 
+static ssize_t bql_show_stall_thrs(struct netdev_queue *queue, char *buf)
+{
+	struct dql *dql = &queue->dql;
+
+	return sprintf(buf, "%u\n", jiffies_to_msecs(dql->stall_thrs));
+}
+
+static ssize_t bql_set_stall_thrs(struct netdev_queue *queue,
+				  const char *buf, size_t len)
+{
+	struct dql *dql = &queue->dql;
+	unsigned int value;
+	int err;
+
+	err = kstrtouint(buf, 10, &value);
+	if (err < 0)
+		return err;
+
+	value = msecs_to_jiffies(value);
+	if (value && (value < 4 || value > 4 / 2 * BITS_PER_LONG))
+		return -ERANGE;
+
+	if (!dql->stall_thrs && value)
+		dql->last_reap = jiffies;
+	/* Force last_reap to be live */
+	smp_wmb();
+	dql->stall_thrs = value;
+
+	return len;
+}
+
+static struct netdev_queue_attribute bql_stall_thrs_attribute __ro_after_init =
+	__ATTR(stall_thrs, 0644, bql_show_stall_thrs, bql_set_stall_thrs);
+
+static ssize_t bql_show_stall_max(struct netdev_queue *queue, char *buf)
+{
+	return sprintf(buf, "%u\n", READ_ONCE(queue->dql.stall_max));
+}
+
+static ssize_t bql_set_stall_max(struct netdev_queue *queue,
+				 const char *buf, size_t len)
+{
+	WRITE_ONCE(queue->dql.stall_max, 0);
+	return len;
+}
+
+static struct netdev_queue_attribute bql_stall_max_attribute __ro_after_init =
+	__ATTR(stall_max, 0644, bql_show_stall_max, bql_set_stall_max);
+
+static ssize_t bql_show_stall_cnt(struct netdev_queue *queue, char *buf)
+{
+	struct dql *dql = &queue->dql;
+
+	return sprintf(buf, "%lu\n", dql->stall_cnt);
+}
+
+static struct netdev_queue_attribute bql_stall_cnt_attribute __ro_after_init =
+	__ATTR(stall_cnt, 0444, bql_show_stall_cnt, NULL);
+
 static ssize_t bql_show_inflight(struct netdev_queue *queue,
 				 char *buf)
 {
@@ -1447,6 +1506,9 @@  static struct attribute *dql_attrs[] __ro_after_init = {
 	&bql_limit_min_attribute.attr,
 	&bql_hold_time_attribute.attr,
 	&bql_inflight_attribute.attr,
+	&bql_stall_thrs_attribute.attr,
+	&bql_stall_cnt_attribute.attr,
+	&bql_stall_max_attribute.attr,
 	NULL
 };