[iwl-net] idpf: extend tx watchdog timeout

Message ID	20240603184714.3697911-1-joshua.a.hay@intel.com (mailing list archive)
State	Awaiting Upstream
Delegated to:	Netdev Maintainers
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 985DD13AD20 for <netdev@vger.kernel.org>; Mon, 3 Jun 2024 18:45:14 +0000 (UTC) From: Joshua Hay <joshua.a.hay@intel.com> To: intel-wired-lan@lists.osuosl.org Cc: netdev@vger.kernel.org, Joshua Hay <joshua.a.hay@intel.com>, Sridhar Samudrala <sridhar.samudrala@intel.com> Subject: [PATCH iwl-net] idpf: extend tx watchdog timeout Date: Mon, 3 Jun 2024 11:47:14 -0700 Message-Id: <20240603184714.3697911-1-joshua.a.hay@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[iwl-net] idpf: extend tx watchdog timeout \| expand [iwl-net] idpf: extend tx watchdog timeout

Context	Check	Description
netdev/series_format	success	Single patches do not need cover letters
netdev/tree_selection	success	Clearly marked for net
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag present in non-next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 901 this patch: 901
netdev/build_tools	success	No tools touched, skip
netdev/cc_maintainers	fail	2 blamed authors not CCed: madhu.chittim@intel.com anthony.l.nguyen@intel.com; 6 maintainers not CCed: pabeni@redhat.com edumazet@google.com madhu.chittim@intel.com kuba@kernel.org anthony.l.nguyen@intel.com jesse.brandeburg@intel.com
netdev/build_clang	success	Errors and warnings before: 905 this patch: 905
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	success	Errors and warnings before: 905 this patch: 905
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 10 lines checked
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 35 this patch: 35
netdev/source_inline	success	Was 0 now: 0

Josh Hay June 3, 2024, 6:47 p.m. UTC

There are several reasons for a TX completion to take longer than usual
to be written back by HW. For example, the completion for a packet that
misses a rule will have increased latency. The side effect of these
variable latencies for any given packet is out of order completions. The
stack sends packet X and Y. If packet X takes longer because of the rule
miss in the example above, but packet Y hits, it can go on the wire
immediately. Which also means it can be completed first.  The driver
will then receive a completion for packet Y before packet X.  The driver
will stash the buffers for packet X in a hash table to allow the tx send
queue descriptors for both packet X and Y to be reused. The driver will
receive the completion for packet X sometime later and have to search
the hash table for the associated packet.

The driver cleans packets directly on the ring first, i.e. not out of
order completions since they are to some extent considered "slow(er)
path". However, certain workloads can increase the frequency of out of
order completions thus introducing even more latency into the cleaning
path. Bump up the timeout value to account for these workloads.

Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration")
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
---
 drivers/net/ethernet/intel/idpf/idpf_lib.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Jacob Keller June 3, 2024, 9:13 p.m. UTC | #1

On 6/3/2024 11:47 AM, Joshua Hay wrote:
> There are several reasons for a TX completion to take longer than usual
> to be written back by HW. For example, the completion for a packet that
> misses a rule will have increased latency. The side effect of these
> variable latencies for any given packet is out of order completions. The
> stack sends packet X and Y. If packet X takes longer because of the rule
> miss in the example above, but packet Y hits, it can go on the wire
> immediately. Which also means it can be completed first.  The driver
> will then receive a completion for packet Y before packet X.  The driver
> will stash the buffers for packet X in a hash table to allow the tx send
> queue descriptors for both packet X and Y to be reused. The driver will
> receive the completion for packet X sometime later and have to search
> the hash table for the associated packet.
> 
> The driver cleans packets directly on the ring first, i.e. not out of
> order completions since they are to some extent considered "slow(er)
> path". However, certain workloads can increase the frequency of out of
> order completions thus introducing even more latency into the cleaning
> path. Bump up the timeout value to account for these workloads.
> 
> Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration")
> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
> ---

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

Nelson, Shannon June 4, 2024, 12:33 a.m. UTC | #2

On 6/3/2024 11:47 AM, Joshua Hay wrote:
> 
> There are several reasons for a TX completion to take longer than usual
> to be written back by HW. For example, the completion for a packet that
> misses a rule will have increased latency. The side effect of these
> variable latencies for any given packet is out of order completions. The
> stack sends packet X and Y. If packet X takes longer because of the rule
> miss in the example above, but packet Y hits, it can go on the wire
> immediately. Which also means it can be completed first.  The driver
> will then receive a completion for packet Y before packet X.  The driver
> will stash the buffers for packet X in a hash table to allow the tx send
> queue descriptors for both packet X and Y to be reused. The driver will
> receive the completion for packet X sometime later and have to search
> the hash table for the associated packet.
> 
> The driver cleans packets directly on the ring first, i.e. not out of
> order completions since they are to some extent considered "slow(er)
> path". However, certain workloads can increase the frequency of out of
> order completions thus introducing even more latency into the cleaning
> path. Bump up the timeout value to account for these workloads.
> 
> Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration")
> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
> ---
>   drivers/net/ethernet/intel/idpf/idpf_lib.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c
> index f1ee5584e8fa..3d4ae2ed9b96 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_lib.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_lib.c
> @@ -770,8 +770,8 @@ static int idpf_cfg_netdev(struct idpf_vport *vport)
>          else
>                  netdev->netdev_ops = &idpf_netdev_ops_singleq;
> 
> -       /* setup watchdog timeout value to be 5 second */
> -       netdev->watchdog_timeo = 5 * HZ;
> +       /* setup watchdog timeout value to be 30 seconds */
> +       netdev->watchdog_timeo = 30 * HZ;

Huh... that's a pretty big number.  If it really needs to be that big I 
wonder if there's something else that needs attention.

sln


> 
>          netdev->dev_port = idx;
> 
> --
> 2.39.2
> 
>

David Decotigny June 4, 2024, 11:34 p.m. UTC | #3

On 6/3/2024 11:47 AM, Joshua Hay wrote:
>
> There are several reasons for a TX completion to take longer than usual
> to be written back by HW. For example, the completion for a packet that
> misses a rule will have increased latency. The side effect of these
> variable latencies for any given packet is out of order completions. The
> stack sends packet X and Y. If packet X takes longer because of the rule
> miss in the example above, but packet Y hits, it can go on the wire
> immediately. Which also means it can be completed first.  The driver
> will then receive a completion for packet Y before packet X.  The driver
> will stash the buffers for packet X in a hash table to allow the tx send
> queue descriptors for both packet X and Y to be reused. The driver will
> receive the completion for packet X sometime later and have to search
> the hash table for the associated packet.
>
> The driver cleans packets directly on the ring first, i.e. not out of
> order completions since they are to some extent considered "slow(er)
> path". However, certain workloads can increase the frequency of out of
> order completions thus introducing even more latency into the cleaning
> path. Bump up the timeout value to account for these workloads.
>
> Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration")
> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
> ---
>   drivers/net/ethernet/intel/idpf/idpf_lib.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)


We tested this patch with our intensive high-performance workloads that
have been able to reproduce the issue, and it was sufficient to avoid tx
timeouts. We also noticed that these longer timeouts are not unusual in
the smartnic space: we see 100s or 50s timeouts for a few NICs, and
other NICs receive this timeout as a hint from the fw.

Reviewed-by: David Decotigny <ddecotig@google.com>

Alexander Lobakin June 11, 2024, 10:44 a.m. UTC | #4

From: David Decotigny <ddecotig@gmail.com>
Date: Tue, 4 Jun 2024 16:34:48 -0700

> 
> 
> On 6/3/2024 11:47 AM, Joshua Hay wrote:
>>
>> There are several reasons for a TX completion to take longer than usual
>> to be written back by HW. For example, the completion for a packet that
>> misses a rule will have increased latency. The side effect of these
>> variable latencies for any given packet is out of order completions. The
>> stack sends packet X and Y. If packet X takes longer because of the rule
>> miss in the example above, but packet Y hits, it can go on the wire
>> immediately. Which also means it can be completed first.  The driver
>> will then receive a completion for packet Y before packet X.  The driver
>> will stash the buffers for packet X in a hash table to allow the tx send
>> queue descriptors for both packet X and Y to be reused. The driver will
>> receive the completion for packet X sometime later and have to search
>> the hash table for the associated packet.
>>
>> The driver cleans packets directly on the ring first, i.e. not out of
>> order completions since they are to some extent considered "slow(er)
>> path". However, certain workloads can increase the frequency of out of
>> order completions thus introducing even more latency into the cleaning
>> path. Bump up the timeout value to account for these workloads.
>>
>> Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration")
>> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
>> ---
>>   drivers/net/ethernet/intel/idpf/idpf_lib.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> 
> We tested this patch with our intensive high-performance workloads that
> have been able to reproduce the issue, and it was sufficient to avoid tx
> timeouts. We also noticed that these longer timeouts are not unusual in
> the smartnic space: we see 100s or 50s timeouts for a few NICs, and

Example?

> other NICs receive this timeout as a hint from the fw.
> 
> Reviewed-by: David Decotigny <ddecotig@google.com>

We either need to teach watchdog core to account OOOs or there's
something really wrong in the driver. For example, how can we then
distinguish if > 5 sec delay happened just due to an OOO or due to that
something went bad with the HW?

Note that there are several patches fixing Tx (incl. timeouts) in my
tree, including yours (Joshua's) which you somehow didn't send yet ._.
Maybe start from them first?

I don't buy 30 seconds, at least for now. Maybe I'm missing something.

Nacked-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Thanks,
Olek

Josh Hay June 11, 2024, 6:13 p.m. UTC | #5

On 6/11/2024 3:44 AM, Alexander Lobakin wrote:
> From: David Decotigny <ddecotig@gmail.com>
> Date: Tue, 4 Jun 2024 16:34:48 -0700
> 
>>
>>
>> On 6/3/2024 11:47 AM, Joshua Hay wrote:
>>>
>>> There are several reasons for a TX completion to take longer than usual
>>> to be written back by HW. For example, the completion for a packet that
>>> misses a rule will have increased latency. The side effect of these
>>> variable latencies for any given packet is out of order completions. The
>>> stack sends packet X and Y. If packet X takes longer because of the rule
>>> miss in the example above, but packet Y hits, it can go on the wire
>>> immediately. Which also means it can be completed first.  The driver
>>> will then receive a completion for packet Y before packet X.  The driver
>>> will stash the buffers for packet X in a hash table to allow the tx send
>>> queue descriptors for both packet X and Y to be reused. The driver will
>>> receive the completion for packet X sometime later and have to search
>>> the hash table for the associated packet.
>>>
>>> The driver cleans packets directly on the ring first, i.e. not out of
>>> order completions since they are to some extent considered "slow(er)
>>> path". However, certain workloads can increase the frequency of out of
>>> order completions thus introducing even more latency into the cleaning
>>> path. Bump up the timeout value to account for these workloads.
>>>
>>> Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration")
>>> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
>>> ---
>>>    drivers/net/ethernet/intel/idpf/idpf_lib.c | 4 ++--
>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>
>>
>> We tested this patch with our intensive high-performance workloads that
>> have been able to reproduce the issue, and it was sufficient to avoid tx
>> timeouts. We also noticed that these longer timeouts are not unusual in
>> the smartnic space: we see 100s or 50s timeouts for a few NICs, and
> 
> Example?
> 
>> other NICs receive this timeout as a hint from the fw.
>>
>> Reviewed-by: David Decotigny <ddecotig@google.com>
> 
> We either need to teach watchdog core to account OOOs or there's
> something really wrong in the driver. For example, how can we then
> distinguish if > 5 sec delay happened just due to an OOO or due to that
> something went bad with the HW?
> 
> Note that there are several patches fixing Tx (incl. timeouts) in my
> tree, including yours (Joshua's) which you somehow didn't send yet ._.
> Maybe start from them first?

I believe it was you who specifically asked our team to defer pushing 
any upstream patches while you were working on the XDP series "to avoid 
having to rebase", which was a reasonable request at the time. We also 
had no reason to believe the existing upstream idpf implementation was 
experiencing timeouts (it is being tested by numerous validation teams). 
So there was no urgency to get those patches upstream. Which patches in 
your tree do you believe fix specific timeout situations? It appears you 
pulled in some of the changes from the out-of-tree driver, but those 
were all enhancements. It wasn't until the workload that David mentioned 
was run on the current driver that we had any indication there were 
timeout issues.

> 
> I don't buy 30 seconds, at least for now. Maybe I'm missing something.
> 
> Nacked-by: Alexander Lobakin <aleksander.lobakin@intel.com>

In the process of debugging the newly discovered timeout, our 
architecture team made it clear that 5 seconds is too low for this type 
of device, with a non deterministic pipeline where packets can take a 
number of exception/slow paths. Admittedly, we don't know the exact 
number, so the solution for the time being was to bump it up with a 
comfortable buffer. As we tune things and debug with various workloads, 
we can bring it back down. As David mentioned, there is precedent for an 
extended timeout for smartnics. Why is it suddenly unacceptable for 
Intel's device?

> 
> Thanks,
> Olek

Thanks,
Josh

Alexander Lobakin June 12, 2024, 9:34 a.m. UTC | #6

From: Josh Hay <joshua.a.hay@intel.com>
Date: Tue, 11 Jun 2024 11:13:53 -0700

> 
> 
> On 6/11/2024 3:44 AM, Alexander Lobakin wrote:
>> From: David Decotigny <ddecotig@gmail.com>
>> Date: Tue, 4 Jun 2024 16:34:48 -0700

[...]

>> Note that there are several patches fixing Tx (incl. timeouts) in my
>> tree, including yours (Joshua's) which you somehow didn't send yet ._.
>> Maybe start from them first?
> 
> I believe it was you who specifically asked our team to defer pushing
> any upstream patches while you were working on the XDP series "to avoid
> having to rebase", which was a reasonable request at the time. We also

It was only related to the virtchnl refactoring and later I cancelled
that when I realized it will go earlier than our series.

> had no reason to believe the existing upstream idpf implementation was
> experiencing timeouts (it is being tested by numerous validation teams).
> So there was no urgency to get those patches upstream. Which patches in
> your tree do you believe fix specific timeout situations? It appears you

[0][1][2]

> pulled in some of the changes from the out-of-tree driver, but those
> were all enhancements. It wasn't until the workload that David mentioned

No, there are all fixes.

[0] is your from the OOT, extended.
[1] is mine and never was in the OOT.
[2] is your from the OOT, extended by Michał.

They really do help.

Note that there's one more Tx timeout patch from you in the OOT, but it
actually broke Tx xD

> was run on the current driver that we had any indication there were
> timeout issues.
> 
>>
>> I don't buy 30 seconds, at least for now. Maybe I'm missing something.
>>
>> Nacked-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> 
> 
> In the process of debugging the newly discovered timeout, our
> architecture team made it clear that 5 seconds is too low for this type
> of device, with a non deterministic pipeline where packets can take a
> number of exception/slow paths. Admittedly, we don't know the exact

Slowpath which takes 30 seconds to complete, seriously?

> number, so the solution for the time being was to bump it up with a
> comfortable buffer. As we tune things and debug with various workloads,
> we can bring it back down. As David mentioned, there is precedent for an
> extended timeout for smartnics. Why is it suddenly unacceptable for
> Intel's device?

I don't know where this "suddenly" comes from.
Because even 5 seconds is too much.
HW usually send packets in microseconds if not faster. Extending the
timeout will hide real issues and make debugging more difficult.

I suspect this all is for OOO packets with an explicit sending timestamp
passed from the userspace, but as I said, you need to teach the kernel
watchdog to account them.
Otherwise, I can ask the driver to send a packet in 31 seconds, what
then, there will be a timeout and you will send a patch to extend it to
60 seconds?

> 
>>
>> Thanks,
>> Olek
> 
> Thanks,
> Josh

[0] https://github.com/alobakin/linux/commit/aad547037598
[1] https://github.com/alobakin/linux/commit/50f4c27ba13e
[2] https://github.com/alobakin/linux/commit/4a9b6c5d0ee8

Thanks,
Olek

David Decotigny June 12, 2024, 6:01 p.m. UTC | #7

On Tue, Jun 11, 2024 at 11:13 AM Josh Hay <joshua.a.hay@intel.com> wrote:
>
>
>
> On 6/11/2024 3:44 AM, Alexander Lobakin wrote:
> > From: David Decotigny <ddecotig@gmail.com>
> > Date: Tue, 4 Jun 2024 16:34:48 -0700
> >
> >>
> >>
> >> On 6/3/2024 11:47 AM, Joshua Hay wrote:
> >>>
> >>> There are several reasons for a TX completion to take longer than usual
> >>> to be written back by HW. For example, the completion for a packet that
> >>> misses a rule will have increased latency. The side effect of these
> >>> variable latencies for any given packet is out of order completions. The
> >>> stack sends packet X and Y. If packet X takes longer because of the rule
> >>> miss in the example above, but packet Y hits, it can go on the wire
> >>> immediately. Which also means it can be completed first.  The driver
> >>> will then receive a completion for packet Y before packet X.  The driver
> >>> will stash the buffers for packet X in a hash table to allow the tx send
> >>> queue descriptors for both packet X and Y to be reused. The driver will
> >>> receive the completion for packet X sometime later and have to search
> >>> the hash table for the associated packet.
> >>>
> >>> The driver cleans packets directly on the ring first, i.e. not out of
> >>> order completions since they are to some extent considered "slow(er)
> >>> path". However, certain workloads can increase the frequency of out of
> >>> order completions thus introducing even more latency into the cleaning
> >>> path. Bump up the timeout value to account for these workloads.
> >>>
> >>> Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration")
> >>> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >>> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
> >>> ---
> >>>    drivers/net/ethernet/intel/idpf/idpf_lib.c | 4 ++--
> >>>    1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >>
> >> We tested this patch with our intensive high-performance workloads that
> >> have been able to reproduce the issue, and it was sufficient to avoid tx
> >> timeouts. We also noticed that these longer timeouts are not unusual in
> >> the smartnic space: we see 100s or 50s timeouts for a few NICs, and
> >
> > Example?

a sample:

drivers/net/ethernet/cavium/thunder/nic.h:#define
NICVF_TX_TIMEOUT                (50 * HZ)
drivers/net/ethernet/cavium/thunder/nicvf_main.c:
netdev->watchdog_timeo = NICVF_TX_TIMEOUT;
drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.h:#define
OTX2_TX_TIMEOUT                (100 * HZ)
drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c:
netdev->watchdog_timeo = OTX2_TX_TIMEOUT;
drivers/net/ethernet/marvell/octeontx2/nic/otx2_vf.c:
netdev->watchdog_timeo = OTX2_TX_TIMEOUT;
drivers/net/ethernet/amazon/ena/ena_netdev.c:
netdev->watchdog_timeo = msecs_to_jiffies(hints->netdev_wd_timeout);

> >
> >> other NICs receive this timeout as a hint from the fw.
> >>
> >> Reviewed-by: David Decotigny <ddecotig@google.com>
> >
> > We either need to teach watchdog core to account OOOs or there's
> > something really wrong in the driver. For example, how can we then
> > distinguish if > 5 sec delay happened just due to an OOO or due to that
> > something went bad with the HW?
> >
> > Note that there are several patches fixing Tx (incl. timeouts) in my
> > tree, including yours (Joshua's) which you somehow didn't send yet ._.
> > Maybe start from them first?
>
> I believe it was you who specifically asked our team to defer pushing
> any upstream patches while you were working on the XDP series "to avoid
> having to rebase", which was a reasonable request at the time. We also
> had no reason to believe the existing upstream idpf implementation was
> experiencing timeouts (it is being tested by numerous validation teams).
> So there was no urgency to get those patches upstream. Which patches in
> your tree do you believe fix specific timeout situations? It appears you
> pulled in some of the changes from the out-of-tree driver, but those
> were all enhancements. It wasn't until the workload that David mentioned
> was run on the current driver that we had any indication there were
> timeout issues.

Some issues were observed with high cpu/memory/network utilization
workloads such as a SAP HANA benchmark.

>
> >
> > I don't buy 30 seconds, at least for now. Maybe I'm missing something.
> >
> > Nacked-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>
>
> In the process of debugging the newly discovered timeout, our
> architecture team made it clear that 5 seconds is too low for this type
> of device, with a non deterministic pipeline where packets can take a
> number of exception/slow paths. Admittedly, we don't know the exact
> number, so the solution for the time being was to bump it up with a
> comfortable buffer. As we tune things and debug with various workloads,
> we can bring it back down. As David mentioned, there is precedent for an
> extended timeout for smartnics. Why is it suddenly unacceptable for
> Intel's device?
>
> >
> > Thanks,
> > Olek
>
> Thanks,
> Josh
--
David Decotigny
--
David Decotigny

Josh Hay June 13, 2024, 6:36 a.m. UTC | #8

On 6/12/2024 2:34 AM, Alexander Lobakin wrote:
> From: Josh Hay <joshua.a.hay@intel.com>
> Date: Tue, 11 Jun 2024 11:13:53 -0700
> 
>>
>>
>> On 6/11/2024 3:44 AM, Alexander Lobakin wrote:
>>> From: David Decotigny <ddecotig@gmail.com>
>>> Date: Tue, 4 Jun 2024 16:34:48 -0700
> 
> [...]
> 
>>> Note that there are several patches fixing Tx (incl. timeouts) in my
>>> tree, including yours (Joshua's) which you somehow didn't send yet ._.
>>> Maybe start from them first?
>>
>> I believe it was you who specifically asked our team to defer pushing
>> any upstream patches while you were working on the XDP series "to avoid
>> having to rebase", which was a reasonable request at the time. We also
> 
> It was only related to the virtchnl refactoring and later I cancelled
> that when I realized it will go earlier than our series.
> 
>> had no reason to believe the existing upstream idpf implementation was
>> experiencing timeouts (it is being tested by numerous validation teams).
>> So there was no urgency to get those patches upstream. Which patches in
>> your tree do you believe fix specific timeout situations? It appears you
> 
> [0][1][2]
> 
>> pulled in some of the changes from the out-of-tree driver, but those
>> were all enhancements. It wasn't until the workload that David mentioned
> 
> No, there are all fixes.
> 
> [0] is your from the OOT, extended. > [1] is mine and never was in the OOT.
> [2] is your from the OOT, extended by Michał.

My main point was since no other tx timeouts have been reported on the 
upstream driver, none of these seem like critical fixes. This particular 
timeout signature did not seem to match any of these in general. E.g. it 
would have been obvious if the timeout was because of what [0] fixes. 
It's also possible these timeouts did not manifest on the upstream 
driver because it is missing other OOT changes.

> 
> They really do help.

No disagreement there. I would've loved to push these changes sooner, 
but we already covered why that didn't happen.

> 
> Note that there's one more Tx timeout patch from you in the OOT, but it
> actually broke Tx xD

If you are implying that our team would commit code that is knowingly 
broken, that is absolutely not true. I believe what you're referring to 
is a change that introduced a tx timeout that also took a very specific 
workload to trigger it. That issue was fixed and not applicable to the 
current upstream implementation, so I do not see how that is relevant to 
this conversation.

> 
>> was run on the current driver that we had any indication there were
>> timeout issues.
>>
>>>
>>> I don't buy 30 seconds, at least for now. Maybe I'm missing something.
>>>
>>> Nacked-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>>
>>
>> In the process of debugging the newly discovered timeout, our
>> architecture team made it clear that 5 seconds is too low for this type
>> of device, with a non deterministic pipeline where packets can take a
>> number of exception/slow paths. Admittedly, we don't know the exact
> 
> Slowpath which takes 30 seconds to complete, seriously?

The architecture team said 5s is too low. 30s was chosen to give ample 
cushion to avoid changing the timeout should this situation come up again.

> 
>> number, so the solution for the time being was to bump it up with a
>> comfortable buffer. As we tune things and debug with various workloads,
>> we can bring it back down. As David mentioned, there is precedent for an
>> extended timeout for smartnics. Why is it suddenly unacceptable for
>> Intel's device?
> 
> I don't know where this "suddenly" comes from.
> Because even 5 seconds is too much.
> HW usually send packets in microseconds if not faster. Extending the
> timeout will hide real issues and make debugging more difficult.

Can you please elaborate on exactly why it would be more difficult? If 
something is actually wrong in HW, it seems unlikely extra time would 
correct it. If something is functionally wrong in the driver, why does 
it matter if it's 5s, 15s, or 30s? It will timeout just the same.

> 
> I suspect this all is for OOO packets with an explicit sending timestamp
> passed from the userspace, but as I said, you need to teach the kernel
> watchdog to account them.

Out of order completions can happen for numerous reasons, some of which 
the driver will know nothing about, i.e. the userspace timestamps are 
not the only things that trigger OOO completions. We can detect that 
we're processing completion B before A, but it's only at that time that 
we can tell the stack to _maybe_ expect a delayed completion. I'm open 
to discussing this further, but it does not seem like a simple solution 
that can be implemented in the immediate future.

> Otherwise, I can ask the driver to send a packet in 31 seconds, what
> then, there will be a timeout and you will send a patch to extend it to
> 60 seconds?

I hope there are checks in the stack itself that would not allow the 
packet to be scheduled beyond the timeout window :)

> 
>>
>>>
>>> Thanks,
>>> Olek
>>
>> Thanks,
>> Josh
> 
> [0] https://github.com/alobakin/linux/commit/aad547037598
> [1] https://github.com/alobakin/linux/commit/50f4c27ba13e
> [2] https://github.com/alobakin/linux/commit/4a9b6c5d0ee8
> 
> Thanks,
> Olek

Thanks,
Josh

Alexander Lobakin June 13, 2024, 9:49 a.m. UTC | #9

From: David Decotigny <ddecotig@google.com>
Date: Wed, 12 Jun 2024 11:01:46 -0700

> On Tue, Jun 11, 2024 at 11:13 AM Josh Hay <joshua.a.hay@intel.com> wrote:
>>
>>
>>
>> On 6/11/2024 3:44 AM, Alexander Lobakin wrote:
>>> From: David Decotigny <ddecotig@gmail.com>
>>> Date: Tue, 4 Jun 2024 16:34:48 -0700
>>>
>>>>
>>>>
>>>> On 6/3/2024 11:47 AM, Joshua Hay wrote:
>>>>>
>>>>> There are several reasons for a TX completion to take longer than usual
>>>>> to be written back by HW. For example, the completion for a packet that
>>>>> misses a rule will have increased latency. The side effect of these
>>>>> variable latencies for any given packet is out of order completions. The
>>>>> stack sends packet X and Y. If packet X takes longer because of the rule
>>>>> miss in the example above, but packet Y hits, it can go on the wire
>>>>> immediately. Which also means it can be completed first.  The driver
>>>>> will then receive a completion for packet Y before packet X.  The driver
>>>>> will stash the buffers for packet X in a hash table to allow the tx send
>>>>> queue descriptors for both packet X and Y to be reused. The driver will
>>>>> receive the completion for packet X sometime later and have to search
>>>>> the hash table for the associated packet.
>>>>>
>>>>> The driver cleans packets directly on the ring first, i.e. not out of
>>>>> order completions since they are to some extent considered "slow(er)
>>>>> path". However, certain workloads can increase the frequency of out of
>>>>> order completions thus introducing even more latency into the cleaning
>>>>> path. Bump up the timeout value to account for these workloads.
>>>>>
>>>>> Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration")
>>>>> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
>>>>> ---
>>>>>    drivers/net/ethernet/intel/idpf/idpf_lib.c | 4 ++--
>>>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>>
>>>> We tested this patch with our intensive high-performance workloads that
>>>> have been able to reproduce the issue, and it was sufficient to avoid tx
>>>> timeouts. We also noticed that these longer timeouts are not unusual in
>>>> the smartnic space: we see 100s or 50s timeouts for a few NICs, and
>>>
>>> Example?
> 
> a sample:
> 
> drivers/net/ethernet/cavium/thunder/nic.h:#define
> NICVF_TX_TIMEOUT                (50 * HZ)
> drivers/net/ethernet/cavium/thunder/nicvf_main.c:
> netdev->watchdog_timeo = NICVF_TX_TIMEOUT;
> drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.h:#define
> OTX2_TX_TIMEOUT                (100 * HZ)
> drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c:
> netdev->watchdog_timeo = OTX2_TX_TIMEOUT;
> drivers/net/ethernet/marvell/octeontx2/nic/otx2_vf.c:
> netdev->watchdog_timeo = OTX2_TX_TIMEOUT;
> drivers/net/ethernet/amazon/ena/ena_netdev.c:
> netdev->watchdog_timeo = msecs_to_jiffies(hints->netdev_wd_timeout);

This one doesn't say anything at all :D

mlx5 has 15 seconds, but mlx4 has the same TO as well, so this might be
some legacy stuff.
Netronome has 5, QLogic has 5, Broadcom 5 etc etc.
These 50-100 belong to one vendor (Cavium is Marvell) and look like a
hack to hide HW issues.

Re "some issues were observed" -- this patch only hides the symptoms, at
least from what I'm seeing currently. Still no details, so that I could
understand the reasons for it.

Thanks,
Olek

Alexander Lobakin June 13, 2024, 10:16 a.m. UTC | #10

From: Josh Hay <joshua.a.hay@intel.com>
Date: Wed, 12 Jun 2024 23:36:29 -0700

> 
> 
> On 6/12/2024 2:34 AM, Alexander Lobakin wrote:
>> From: Josh Hay <joshua.a.hay@intel.com>
>> Date: Tue, 11 Jun 2024 11:13:53 -0700
>>
>>>
>>>
>>> On 6/11/2024 3:44 AM, Alexander Lobakin wrote:
>>>> From: David Decotigny <ddecotig@gmail.com>
>>>> Date: Tue, 4 Jun 2024 16:34:48 -0700
>>
>> [...]
>>
>>>> Note that there are several patches fixing Tx (incl. timeouts) in my
>>>> tree, including yours (Joshua's) which you somehow didn't send yet ._.
>>>> Maybe start from them first?
>>>
>>> I believe it was you who specifically asked our team to defer pushing
>>> any upstream patches while you were working on the XDP series "to avoid
>>> having to rebase", which was a reasonable request at the time. We also
>>
>> It was only related to the virtchnl refactoring and later I cancelled
>> that when I realized it will go earlier than our series.
>>
>>> had no reason to believe the existing upstream idpf implementation was
>>> experiencing timeouts (it is being tested by numerous validation teams).
>>> So there was no urgency to get those patches upstream. Which patches in
>>> your tree do you believe fix specific timeout situations? It appears you
>>
>> [0][1][2]
>>
>>> pulled in some of the changes from the out-of-tree driver, but those
>>> were all enhancements. It wasn't until the workload that David mentioned
>>
>> No, there are all fixes.
>>
>> [0] is your from the OOT, extended. > [1] is mine and never was in the
>> OOT.
>> [2] is your from the OOT, extended by Michał.
> 
> My main point was since no other tx timeouts have been reported on the
> upstream driver, none of these seem like critical fixes. This particular

Lots of bugs have never been reported, this doesn't mean they don't exist.

> timeout signature did not seem to match any of these in general. E.g. it
> would have been obvious if the timeout was because of what [0] fixes.
> It's also possible these timeouts did not manifest on the upstream
> driver because it is missing other OOT changes.

What I'm saying is that why not try to reproduce the issues that this
patch tries to hide on my tree with [0][1][2] first and see whether
they're still here? If they fix the issue, then why extend the timeout?

> 
>>
>> They really do help.
> 
> No disagreement there. I would've loved to push these changes sooner,
> but we already covered why that didn't happen.

Because "they're not critical"? Intel's been having Tx timeouts on idpf
for years. I'd say that at least [1] is critical since this is obvious
bug. Without [2], XDP just doesn't work on idpf.
Why isn't [0] critical since without this patch, you stash partial
packets? And in some cases, ntc/ntu can go beyond the ring size? Is it okay?

> 
>>
>> Note that there's one more Tx timeout patch from you in the OOT, but it
>> actually broke Tx xD
> 
> If you are implying that our team would commit code that is knowingly
> broken, that is absolutely not true. I believe what you're referring to

I didn't say that. I just said that I tried to pull your latest yet
another Tx timeout fix and with it, I have more timeouts than without.

> is a change that introduced a tx timeout that also took a very specific
> workload to trigger it. That issue was fixed and not applicable to the
> current upstream implementation, so I do not see how that is relevant to
> this conversation.

It was just a side note, don't focus too much on such instead of really
important stuff.

> 
>>
>>> was run on the current driver that we had any indication there were
>>> timeout issues.
>>>
>>>>
>>>> I don't buy 30 seconds, at least for now. Maybe I'm missing something.
>>>>
>>>> Nacked-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>>>
>>>
>>> In the process of debugging the newly discovered timeout, our
>>> architecture team made it clear that 5 seconds is too low for this type
>>> of device, with a non deterministic pipeline where packets can take a
>>> number of exception/slow paths. Admittedly, we don't know the exact
>>
>> Slowpath which takes 30 seconds to complete, seriously?
> 
> The architecture team said 5s is too low. 30s was chosen to give ample
> cushion to avoid changing the timeout should this situation come up again.

Just "said", without any details why? Now I can say that 5 is too much
and we need to have 1 second there. Why believe them, not me, w/o any
arguments/explanation?

(I at least have an argument that usually packets get sent in a couple
 us or faster and even 1 second between triggering the sending and
 receiving a completion is too much and if that happens, there were
 some HW issues)

> 
>>
>>> number, so the solution for the time being was to bump it up with a
>>> comfortable buffer. As we tune things and debug with various workloads,
>>> we can bring it back down. As David mentioned, there is precedent for an
>>> extended timeout for smartnics. Why is it suddenly unacceptable for
>>> Intel's device?
>>
>> I don't know where this "suddenly" comes from.
>> Because even 5 seconds is too much.
>> HW usually send packets in microseconds if not faster. Extending the
>> timeout will hide real issues and make debugging more difficult.
> 
> Can you please elaborate on exactly why it would be more difficult? If
> something is actually wrong in HW, it seems unlikely extra time would
> correct it. If something is functionally wrong in the driver, why does
> it matter if it's 5s, 15s, or 30s? It will timeout just the same.

HW can hang and perform a reset and if you have 30 seconds for Tx
timeout, you can even not notice it. Having 5 seconds or lower for the
timeout will most likely spot it.

> 
>>
>> I suspect this all is for OOO packets with an explicit sending timestamp
>> passed from the userspace, but as I said, you need to teach the kernel
>> watchdog to account them.
> 
> Out of order completions can happen for numerous reasons, some of which
> the driver will know nothing about, i.e. the userspace timestamps are
> not the only things that trigger OOO completions. We can detect that
> we're processing completion B before A, but it's only at that time that
> we can tell the stack to _maybe_ expect a delayed completion. I'm open
> to discussing this further, but it does not seem like a simple solution
> that can be implemented in the immediate future.

I still didn't get an explanation why a packet can receive an OOO
completion in whole 30 seconds instead of a couple microseconds.
I also don't understand why we can receive an OOO completion for packets
that were send without an explicit timestamp from the userspace. Without
the timestamp (or some sort of priority etc.), they should be sent in
the same order as they were passed to the driver, right? If so, why a
packet can get an OOO completion, which means it was sent not in the
same order as the driver got them?
Does HW do some prioritizing even if the kernel didn't ask for it? But
even if so, HW shouldn't in general delay sending for 30 seconds?

> 
>> Otherwise, I can ask the driver to send a packet in 31 seconds, what
>> then, there will be a timeout and you will send a patch to extend it to
>> 60 seconds?
> 
> I hope there are checks in the stack itself that would not allow the
> packet to be scheduled beyond the timeout window :)

I don't think so, but since I haven't read the actual code, I won't say
anything particular. Anyway, it was just a side note, my main concerns
are above.

Thanks,
Olek

[iwl-net] idpf: extend tx watchdog timeout

Checks

Commit Message

Comments

Patch