diff mbox series

[net,2/2] pds_core: Add a retry mechanism when the adminq is full

Message ID 20250129004337.36898-3-shannon.nelson@amd.com (mailing list archive)
State New
Delegated to: Netdev Maintainers
Headers show
Series pds_core: fixes for adminq overflow | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success Errors and warnings before: 26 (+1) this patch: 26 (+1)
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 18 this patch: 18
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: msleep < 20ms can sleep for up to 20ms; see function description of msleep().
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2025-01-29--12-00 (tests: 885)

Commit Message

Shannon Nelson Jan. 29, 2025, 12:43 a.m. UTC
From: Brett Creeley <brett.creeley@amd.com>

If the adminq is full, the driver reports failure when trying to post
new adminq commands. This is a bit aggressive and unexpected because
technically the adminq post didn't fail in this case, it was just full.
To harden this path add support for a bounded retry mechanism.

It's possible some commands take longer than expected, maybe hundreds
of milliseconds or seconds due to other processing on the device side,
so to further reduce the chance of failure due to adminq full increase
the PDS_CORE_DEVCMD_TIMEOUT from 5 to 10 seconds.

The caller of pdsc_adminq_post() may still see -EAGAIN reported if the
space in the adminq never freed up. In this case they can choose to
call the function again or fail. For now, no callers will retry.

Fixes: 01ba61b55b20 ("pds_core: Add adminq processing and commands")
Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
---
 drivers/net/ethernet/amd/pds_core/adminq.c | 22 ++++++++++++++++++----
 include/linux/pds/pds_core_if.h            |  2 +-
 2 files changed, 19 insertions(+), 5 deletions(-)

Comments

Michal Swiatkowski Jan. 29, 2025, 8:08 a.m. UTC | #1
On Tue, Jan 28, 2025 at 04:43:37PM -0800, Shannon Nelson wrote:
> From: Brett Creeley <brett.creeley@amd.com>
> 
> If the adminq is full, the driver reports failure when trying to post
> new adminq commands. This is a bit aggressive and unexpected because
> technically the adminq post didn't fail in this case, it was just full.
> To harden this path add support for a bounded retry mechanism.
> 
> It's possible some commands take longer than expected, maybe hundreds
> of milliseconds or seconds due to other processing on the device side,
> so to further reduce the chance of failure due to adminq full increase
> the PDS_CORE_DEVCMD_TIMEOUT from 5 to 10 seconds.
> 
> The caller of pdsc_adminq_post() may still see -EAGAIN reported if the
> space in the adminq never freed up. In this case they can choose to
> call the function again or fail. For now, no callers will retry.
> 
> Fixes: 01ba61b55b20 ("pds_core: Add adminq processing and commands")
> Signed-off-by: Brett Creeley <brett.creeley@amd.com>
> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
> ---
>  drivers/net/ethernet/amd/pds_core/adminq.c | 22 ++++++++++++++++++----
>  include/linux/pds/pds_core_if.h            |  2 +-
>  2 files changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/amd/pds_core/adminq.c b/drivers/net/ethernet/amd/pds_core/adminq.c
> index c83a0a80d533..387de1712827 100644
> --- a/drivers/net/ethernet/amd/pds_core/adminq.c
> +++ b/drivers/net/ethernet/amd/pds_core/adminq.c
> @@ -181,7 +181,10 @@ static int __pdsc_adminq_post(struct pdsc *pdsc,
>  	else
>  		avail -= q->head_idx + 1;
>  	if (!avail) {
> -		ret = -ENOSPC;
> +		if (!pdsc_is_fw_running(pdsc))
> +			ret = -ENXIO;
> +		else
> +			ret = -EAGAIN;

Short if will fit nice here, anyway:
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>

>  		goto err_out_unlock;
>  	}
>  
> @@ -251,14 +254,25 @@ int pdsc_adminq_post(struct pdsc *pdsc,
>  	}
>  
>  	wc.qcq = &pdsc->adminqcq;
> -	index = __pdsc_adminq_post(pdsc, &pdsc->adminqcq, cmd, comp, &wc);
> +	time_start = jiffies;
> +	time_limit = time_start + HZ * pdsc->devcmd_timeout;
> +	do {
> +		index = __pdsc_adminq_post(pdsc, &pdsc->adminqcq, cmd, comp,
> +					   &wc);
> +		if (index != -EAGAIN)
> +			break;
> +
> +		dev_dbg(pdsc->dev, "Retrying adminq cmd opcode %u\n",
> +			cmd->opcode);
> +		/* Give completion processing a chance to free up space */
> +		msleep(1);
> +	} while (time_before(jiffies, time_limit));
> +
>  	if (index < 0) {
>  		err = index;
>  		goto err_out;
>  	}
>  
> -	time_start = jiffies;
> -	time_limit = time_start + HZ * pdsc->devcmd_timeout;
>  	do {
>  		/* Timeslice the actual wait to catch IO errors etc early */
>  		poll_jiffies = msecs_to_jiffies(poll_interval);
> diff --git a/include/linux/pds/pds_core_if.h b/include/linux/pds/pds_core_if.h
> index 17a87c1a55d7..babc6d573acd 100644
> --- a/include/linux/pds/pds_core_if.h
> +++ b/include/linux/pds/pds_core_if.h
> @@ -22,7 +22,7 @@
>  #define PDS_CORE_BAR0_INTR_CTRL_OFFSET		0x2000
>  #define PDS_CORE_DEV_CMD_DONE			0x00000001
>  
> -#define PDS_CORE_DEVCMD_TIMEOUT			5
> +#define PDS_CORE_DEVCMD_TIMEOUT			10
>  
>  #define PDS_CORE_CLIENT_ID			0
>  #define PDS_CORE_ASIC_TYPE_CAPRI		0
> -- 
> 2.17.1
Jakub Kicinski Jan. 30, 2025, 3:03 a.m. UTC | #2
On Tue, 28 Jan 2025 16:43:37 -0800 Shannon Nelson wrote:
> If the adminq is full, the driver reports failure when trying to post
> new adminq commands. This is a bit aggressive and unexpected because
> technically the adminq post didn't fail in this case, it was just full.
> To harden this path add support for a bounded retry mechanism.
> 
> It's possible some commands take longer than expected, maybe hundreds
> of milliseconds or seconds due to other processing on the device side,
> so to further reduce the chance of failure due to adminq full increase
> the PDS_CORE_DEVCMD_TIMEOUT from 5 to 10 seconds.
> 
> The caller of pdsc_adminq_post() may still see -EAGAIN reported if the
> space in the adminq never freed up. In this case they can choose to
> call the function again or fail. For now, no callers will retry.

How about a semaphore? You can initialize it to the number of slots
in the queue, and use down_timeout() if you want the 10 sec timeout?
diff mbox series

Patch

diff --git a/drivers/net/ethernet/amd/pds_core/adminq.c b/drivers/net/ethernet/amd/pds_core/adminq.c
index c83a0a80d533..387de1712827 100644
--- a/drivers/net/ethernet/amd/pds_core/adminq.c
+++ b/drivers/net/ethernet/amd/pds_core/adminq.c
@@ -181,7 +181,10 @@  static int __pdsc_adminq_post(struct pdsc *pdsc,
 	else
 		avail -= q->head_idx + 1;
 	if (!avail) {
-		ret = -ENOSPC;
+		if (!pdsc_is_fw_running(pdsc))
+			ret = -ENXIO;
+		else
+			ret = -EAGAIN;
 		goto err_out_unlock;
 	}
 
@@ -251,14 +254,25 @@  int pdsc_adminq_post(struct pdsc *pdsc,
 	}
 
 	wc.qcq = &pdsc->adminqcq;
-	index = __pdsc_adminq_post(pdsc, &pdsc->adminqcq, cmd, comp, &wc);
+	time_start = jiffies;
+	time_limit = time_start + HZ * pdsc->devcmd_timeout;
+	do {
+		index = __pdsc_adminq_post(pdsc, &pdsc->adminqcq, cmd, comp,
+					   &wc);
+		if (index != -EAGAIN)
+			break;
+
+		dev_dbg(pdsc->dev, "Retrying adminq cmd opcode %u\n",
+			cmd->opcode);
+		/* Give completion processing a chance to free up space */
+		msleep(1);
+	} while (time_before(jiffies, time_limit));
+
 	if (index < 0) {
 		err = index;
 		goto err_out;
 	}
 
-	time_start = jiffies;
-	time_limit = time_start + HZ * pdsc->devcmd_timeout;
 	do {
 		/* Timeslice the actual wait to catch IO errors etc early */
 		poll_jiffies = msecs_to_jiffies(poll_interval);
diff --git a/include/linux/pds/pds_core_if.h b/include/linux/pds/pds_core_if.h
index 17a87c1a55d7..babc6d573acd 100644
--- a/include/linux/pds/pds_core_if.h
+++ b/include/linux/pds/pds_core_if.h
@@ -22,7 +22,7 @@ 
 #define PDS_CORE_BAR0_INTR_CTRL_OFFSET		0x2000
 #define PDS_CORE_DEV_CMD_DONE			0x00000001
 
-#define PDS_CORE_DEVCMD_TIMEOUT			5
+#define PDS_CORE_DEVCMD_TIMEOUT			10
 
 #define PDS_CORE_CLIENT_ID			0
 #define PDS_CORE_ASIC_TYPE_CAPRI		0