diff mbox series

[RFC] net/sched: adjust device watchdog timer to detect stopped queue at right time

Message ID 20240430140010.5005-1-praveen.kannoju@oracle.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [RFC] net/sched: adjust device watchdog timer to detect stopped queue at right time | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 934 this patch: 934
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 3 maintainers not CCed: pabeni@redhat.com kuba@kernel.org edumazet@google.com
netdev/build_clang success Errors and warnings before: 938 this patch: 938
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 945 this patch: 945
netdev/checkpatch warning WARNING: line length of 85 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 93 exceeds 80 columns WARNING: line length of 96 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Praveen Kumar Kannoju April 30, 2024, 2 p.m. UTC
Applications are sensitive to long network latency, particularly
heartbeat monitoring ones. Longer the tx timeout recovery higher the
risk with such applications on a production machines. This patch
remedies, yet honoring device set tx timeout.

Modify watchdog next timeout to be shorter than the device specified.
Compute the next timeout be equal to device watchdog timeout less the
how long ago queue stop had been done. At next watchdog timeout tx
timeout handler is called into if still in stopped state. Either called
or not called, restore the watchdog timeout back to device specified.

For example, mellanox driver with 15 sec watchdog timeout on its
interfaces will be called for handling timeout at random times as shown
below:

	mlx5_core 0000:af:00.0 enp175s0f0: TX timeout on queue: 40, SQ: 0x13106, CQ: 0x5b5, SQ Cons: 0x1 SQ Prod: 0x1, usecs since last trans: 28559000
	mlx5_core 0000:af:00.0 enp175s0f0: TX timeout on queue: 59, SQ: 0x132a4, CQ: 0x4f3, SQ Cons: 0x1 SQ Prod: 0x1, usecs since last trans: 21424000
	mlx5_core 0000:af:00.0 enp175s0f0: TX timeout on queue: 42, SQ: 0x12688, CQ: 0x5b5, SQ Cons: 0x1 SQ Prod: 0x1, usecs since last trans: 15919000

whereas with the proposed fix, timeout handler is always called at
appropriate time set into the watchdog by the driver as shown below:

	mlx5_core 0000:af:00.0 enp175s0f0: TX timeout on queue: 42, SQ: 0x122d6, CQ: 0x4af, SQ Cons: 0x4 SQ Prod: 0x4, usecs since last trans: 15137000
	mlx5_core 0000:af:00.0 enp175s0f0: TX timeout on queue: 33, SQ: 0x1410f, CQ: 0x48d, SQ Cons: 0xb SQ Prod: 0xb, usecs since last trans: 15568000
	mlx5_core 0000:af:00.0 enp175s0f0: TX timeout on queue: 33, SQ: 0x1424a, CQ: 0x599, SQ Cons: 0x8 SQ Prod: 0x8, usecs since last trans: 15539000

Signed-off-by: Praveen Kumar Kannoju <praveen.kannoju@oracle.com>
---
 net/sched/sch_generic.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

Comments

Jakub Kicinski May 1, 2024, 10:12 p.m. UTC | #1
On Tue, 30 Apr 2024 19:30:10 +0530 Praveen Kumar Kannoju wrote:
> Applications are sensitive to long network latency, particularly
> heartbeat monitoring ones. Longer the tx timeout recovery higher the
> risk with such applications on a production machines. This patch
> remedies, yet honoring device set tx timeout.
> 
> Modify watchdog next timeout to be shorter than the device specified.
> Compute the next timeout be equal to device watchdog timeout less the
> how long ago queue stop had been done. At next watchdog timeout tx
> timeout handler is called into if still in stopped state. Either called
> or not called, restore the watchdog timeout back to device specified.

Idea makes sense, some comments on the code below.

> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 4a2c763e2d11..64e31f8b4ac1 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -506,18 +506,25 @@ static void dev_watchdog(struct timer_list *t)
>  			unsigned int timedout_ms = 0;
>  			unsigned int i;
>  			unsigned long trans_start;
> +			unsigned long next_check = 0;
> +			unsigned long current_jiffies;
>  
>  			for (i = 0; i < dev->num_tx_queues; i++) {
>  				struct netdev_queue *txq;
> +				current_jiffies = jiffies;

Not sure why you save current jiffies.

>  				txq = netdev_get_tx_queue(dev, i);
>  				trans_start = READ_ONCE(txq->trans_start);
> -				if (netif_xmit_stopped(txq) &&
> -				    time_after(jiffies, (trans_start +
> -							 dev->watchdog_timeo))) {
> -					timedout_ms = jiffies_to_msecs(jiffies - trans_start);
> -					atomic_long_inc(&txq->trans_timeout);
> -					break;
> +				if (netif_xmit_stopped(txq)) {

please use continue instead of adding another indentation level

> +					if (time_after(current_jiffies, (trans_start +

wrap at 80 characters

> +								   dev->watchdog_timeo))) {
> +						timedout_ms = jiffies_to_msecs(current_jiffies -
> +										trans_start);
> +						atomic_long_inc(&txq->trans_timeout);
> +						break;
> +					}
> +					next_check = trans_start + dev->watchdog_timeo -
> +									current_jiffies;

this will give us "next_check" for last queue. Let's instead find the
oldest trans_start in the loop. Do:

		unsigned long oldest_start = jiffies;

then in the loop:

		oldest_start = min(...)

>  				}
>  			}
>  
> @@ -530,9 +537,11 @@ static void dev_watchdog(struct timer_list *t)
>  				dev->netdev_ops->ndo_tx_timeout(dev, i);
>  				netif_unfreeze_queues(dev);
>  			}
> +			if (!next_check)
> +				next_check = dev->watchdog_timeo;
>  			if (!mod_timer(&dev->watchdog_timer,
>  				       round_jiffies(jiffies +
> -						     dev->watchdog_timeo)))
> +						     next_check)))

then here you just need to swap jiffies for oldest_start

>  				release = false;
>  		}
>  	}
Praveen Kumar Kannoju May 3, 2024, 2:28 p.m. UTC | #2
Hi Jakub,
Thank you very much for the review and positive feedback.
We've incorporated your review comments except, using "continue" instead of adding another indentation level.  
Request you to elaborate your comment on it.

-
Praveen.

> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: 02 May 2024 03:43 AM
> To: Praveen Kannoju <praveen.kannoju@oracle.com>
> Cc: jhs@mojatatu.com; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; netdev@vger.kernel.org; linux-
> kernel@vger.kernel.org; Rajesh Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>; Rama Nichanamatlu
> <rama.nichanamatlu@oracle.com>; Manjunath Patil <manjunath.b.patil@oracle.com>
> Subject: Re: [PATCH RFC] net/sched: adjust device watchdog timer to detect stopped queue at right time
> 
> On Tue, 30 Apr 2024 19:30:10 +0530 Praveen Kumar Kannoju wrote:
> > Applications are sensitive to long network latency, particularly
> > heartbeat monitoring ones. Longer the tx timeout recovery higher the
> > risk with such applications on a production machines. This patch
> > remedies, yet honoring device set tx timeout.
> >
> > Modify watchdog next timeout to be shorter than the device specified.
> > Compute the next timeout be equal to device watchdog timeout less the
> > how long ago queue stop had been done. At next watchdog timeout tx
> > timeout handler is called into if still in stopped state. Either
> > called or not called, restore the watchdog timeout back to device specified.
> 
> Idea makes sense, some comments on the code below.
> 
> > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index
> > 4a2c763e2d11..64e31f8b4ac1 100644
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -506,18 +506,25 @@ static void dev_watchdog(struct timer_list *t)
> >  			unsigned int timedout_ms = 0;
> >  			unsigned int i;
> >  			unsigned long trans_start;
> > +			unsigned long next_check = 0;
> > +			unsigned long current_jiffies;
> >
> >  			for (i = 0; i < dev->num_tx_queues; i++) {
> >  				struct netdev_queue *txq;
> > +				current_jiffies = jiffies;
> 
> Not sure why you save current jiffies.
> 
> >  				txq = netdev_get_tx_queue(dev, i);
> >  				trans_start = READ_ONCE(txq->trans_start);
> > -				if (netif_xmit_stopped(txq) &&
> > -				    time_after(jiffies, (trans_start +
> > -							 dev->watchdog_timeo))) {
> > -					timedout_ms = jiffies_to_msecs(jiffies - trans_start);
> > -					atomic_long_inc(&txq->trans_timeout);
> > -					break;
> > +				if (netif_xmit_stopped(txq)) {
> 
> please use continue instead of adding another indentation level

We need to take decision on whether to break out of loop or modify "oldest_start" only when
Queue is stopped. Hence one more level of indentation is needed. Can you please elaborate
on using "continue" in existing condition instead of adding a new indentation level.

> 
> > +					if (time_after(current_jiffies, (trans_start +
> 
> wrap at 80 characters
> 
> > +								   dev->watchdog_timeo))) {
> > +						timedout_ms = jiffies_to_msecs(current_jiffies -
> > +										trans_start);
> > +						atomic_long_inc(&txq->trans_timeout);
> > +						break;
> > +					}
> > +					next_check = trans_start + dev->watchdog_timeo -
> > +									current_jiffies;
> 
> this will give us "next_check" for last queue. Let's instead find the oldest trans_start in the loop. Do:
> 
> 		unsigned long oldest_start = jiffies;
> 
> then in the loop:
> 
> 		oldest_start = min(...)
> 
> >  				}
> >  			}
> >
> > @@ -530,9 +537,11 @@ static void dev_watchdog(struct timer_list *t)
> >  				dev->netdev_ops->ndo_tx_timeout(dev, i);
> >  				netif_unfreeze_queues(dev);
> >  			}
> > +			if (!next_check)
> > +				next_check = dev->watchdog_timeo;
> >  			if (!mod_timer(&dev->watchdog_timer,
> >  				       round_jiffies(jiffies +
> > -						     dev->watchdog_timeo)))
> > +						     next_check)))
> 
> then here you just need to swap jiffies for oldest_start
> 
> >  				release = false;
> >  		}
> >  	}
Jakub Kicinski May 3, 2024, 7:29 p.m. UTC | #3
On Fri, 3 May 2024 14:28:13 +0000 Praveen Kannoju wrote:
> > >  				txq = netdev_get_tx_queue(dev, i);
> > >  				trans_start = READ_ONCE(txq->trans_start);
> > > -				if (netif_xmit_stopped(txq) &&
> > > -				    time_after(jiffies, (trans_start +
> > > -							 dev->watchdog_timeo))) {
> > > -					timedout_ms = jiffies_to_msecs(jiffies - trans_start);
> > > -					atomic_long_inc(&txq->trans_timeout);
> > > -					break;
> > > +				if (netif_xmit_stopped(txq)) {  
> > 
> > please use continue instead of adding another indentation level  
> 
> We need to take decision on whether to break out of loop or modify "oldest_start" only when
> Queue is stopped. Hence one more level of indentation is needed. Can you please elaborate
> on using "continue" in existing condition instead of adding a new indentation level.

If the queue is not stopped, continue. Split the condition into
multiple ifs.

> > > +								   dev->watchdog_timeo))) {
> > > +						timedout_ms = jiffies_to_msecs(current_jiffies -
> > > +										trans_start);
> > > +						atomic_long_inc(&txq->trans_timeout);
> > > +						break;
> > > +					}
> > > +					next_check = trans_start + dev->watchdog_timeo -
> > > +									current_jiffies;  
> > 
> > this will give us "next_check" for last queue. Let's instead find the oldest trans_start in the loop. Do:
> > 
> > 		unsigned long oldest_start = jiffies;
> > 
> > then in the loop:
> > 
> > 		oldest_start = min(...)

BTW, the min() I suggested here needs to be a if (time_after(...)),
we can't use bare min() to compare jiffies, because they may wrap.
Praveen Kumar Kannoju May 6, 2024, 2:02 p.m. UTC | #4
Thank you, Jakub.
Your comments have been addressed in the v2 patch. We've tested it internally and the patch works as expected. Please review and let us know If any additional changes are needed. 

-
Praveen.

> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: 04 May 2024 01:00 AM
> To: Praveen Kannoju <praveen.kannoju@oracle.com>
> Cc: jhs@mojatatu.com; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; netdev@vger.kernel.org; linux-
> kernel@vger.kernel.org; Rajesh Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>; Rama Nichanamatlu
> <rama.nichanamatlu@oracle.com>; Manjunath Patil <manjunath.b.patil@oracle.com>
> Subject: Re: [PATCH RFC] net/sched: adjust device watchdog timer to detect stopped queue at right time
> 
> On Fri, 3 May 2024 14:28:13 +0000 Praveen Kannoju wrote:
> > > >  				txq = netdev_get_tx_queue(dev, i);
> > > >  				trans_start = READ_ONCE(txq->trans_start);
> > > > -				if (netif_xmit_stopped(txq) &&
> > > > -				    time_after(jiffies, (trans_start +
> > > > -							 dev->watchdog_timeo))) {
> > > > -					timedout_ms = jiffies_to_msecs(jiffies - trans_start);
> > > > -					atomic_long_inc(&txq->trans_timeout);
> > > > -					break;
> > > > +				if (netif_xmit_stopped(txq)) {
> > >
> > > please use continue instead of adding another indentation level
> >
> > We need to take decision on whether to break out of loop or modify
> > "oldest_start" only when Queue is stopped. Hence one more level of
> > indentation is needed. Can you please elaborate on using "continue" in existing condition instead of adding a new indentation level.
> 
> If the queue is not stopped, continue. Split the condition into multiple ifs.
> 
> > > > +								   dev->watchdog_timeo))) {
> > > > +						timedout_ms = jiffies_to_msecs(current_jiffies -
> > > > +										trans_start);
> > > > +						atomic_long_inc(&txq->trans_timeout);
> > > > +						break;
> > > > +					}
> > > > +					next_check = trans_start + dev->watchdog_timeo -
> > > > +									current_jiffies;
> > >
> > > this will give us "next_check" for last queue. Let's instead find the oldest trans_start in the loop. Do:
> > >
> > > 		unsigned long oldest_start = jiffies;
> > >
> > > then in the loop:
> > >
> > > 		oldest_start = min(...)
> 
> BTW, the min() I suggested here needs to be a if (time_after(...)), we can't use bare min() to compare jiffies, because they may wrap.
diff mbox series

Patch

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 4a2c763e2d11..64e31f8b4ac1 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -506,18 +506,25 @@  static void dev_watchdog(struct timer_list *t)
 			unsigned int timedout_ms = 0;
 			unsigned int i;
 			unsigned long trans_start;
+			unsigned long next_check = 0;
+			unsigned long current_jiffies;
 
 			for (i = 0; i < dev->num_tx_queues; i++) {
 				struct netdev_queue *txq;
+				current_jiffies = jiffies;
 
 				txq = netdev_get_tx_queue(dev, i);
 				trans_start = READ_ONCE(txq->trans_start);
-				if (netif_xmit_stopped(txq) &&
-				    time_after(jiffies, (trans_start +
-							 dev->watchdog_timeo))) {
-					timedout_ms = jiffies_to_msecs(jiffies - trans_start);
-					atomic_long_inc(&txq->trans_timeout);
-					break;
+				if (netif_xmit_stopped(txq)) {
+					if (time_after(current_jiffies, (trans_start +
+								   dev->watchdog_timeo))) {
+						timedout_ms = jiffies_to_msecs(current_jiffies -
+										trans_start);
+						atomic_long_inc(&txq->trans_timeout);
+						break;
+					}
+					next_check = trans_start + dev->watchdog_timeo -
+									current_jiffies;
 				}
 			}
 
@@ -530,9 +537,11 @@  static void dev_watchdog(struct timer_list *t)
 				dev->netdev_ops->ndo_tx_timeout(dev, i);
 				netif_unfreeze_queues(dev);
 			}
+			if (!next_check)
+				next_check = dev->watchdog_timeo;
 			if (!mod_timer(&dev->watchdog_timer,
 				       round_jiffies(jiffies +
-						     dev->watchdog_timeo)))
+						     next_check)))
 				release = false;
 		}
 	}