diff mbox series

[net] net: ethernet: ti: am65-cpsw: fix freeing IRQ in am65_cpsw_nuss_remove_tx_chns()

Message ID 20250114-am65-cpsw-fix-tx-irq-free-v1-1-b2069e6ed185@kernel.org (mailing list archive)
State New
Delegated to: Netdev Maintainers
Headers show
Series [net] net: ethernet: ti: am65-cpsw: fix freeing IRQ in am65_cpsw_nuss_remove_tx_chns() | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 1 maintainers not CCed: jpanis@baylibre.com
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 8 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2025-01-14--18-00 (tests: 884)

Commit Message

Roger Quadros Jan. 14, 2025, 4:44 p.m. UTC
When getting the IRQ we use k3_udma_glue_rx_get_irq() which returns
negative error value on error. So not NULL check is not sufficient
to deteremine if IRQ is valid. Check that IRQ is greater then zero
to ensure it is valid.

Fixes: 93a76530316a ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver")
Signed-off-by: Roger Quadros <rogerq@kernel.org>
---
 drivers/net/ethernet/ti/am65-cpsw-nuss.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


---
base-commit: 5bc55a333a2f7316b58edc7573e8e893f7acb532
change-id: 20250114-am65-cpsw-fix-tx-irq-free-846ac55ee6e1

Best regards,

Comments

Simon Horman Jan. 14, 2025, 5:12 p.m. UTC | #1
On Tue, Jan 14, 2025 at 06:44:02PM +0200, Roger Quadros wrote:
> When getting the IRQ we use k3_udma_glue_rx_get_irq() which returns
> negative error value on error. So not NULL check is not sufficient
> to deteremine if IRQ is valid. Check that IRQ is greater then zero
> to ensure it is valid.
> 
> Fixes: 93a76530316a ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver")
> Signed-off-by: Roger Quadros <rogerq@kernel.org>

Reviewed-by: Simon Horman <horms@kernel.org>
Siddharth Vadapalli Jan. 15, 2025, 5:18 a.m. UTC | #2
On Tue, Jan 14, 2025 at 06:44:02PM +0200, Roger Quadros wrote:

Hello Roger,

> When getting the IRQ we use k3_udma_glue_rx_get_irq() which returns

You probably meant "k3_udma_glue_tx_get_irq()" instead? It is used to
assign tx_chn->irq within am65_cpsw_nuss_init_tx_chns() as follows:

		tx_chn->irq = k3_udma_glue_tx_get_irq(tx_chn->tx_chn);

Additionally, following the above section we have:

		if (tx_chn->irq < 0) {
			dev_err(dev, "Failed to get tx dma irq %d\n",
				tx_chn->irq);
			ret = tx_chn->irq;
			goto err;
		}

Could you please provide details on the code-path which will lead to a
negative "tx_chn->irq" within "am65_cpsw_nuss_remove_tx_chns()"?

There seem to be two callers of am65_cpsw_nuss_remove_tx_chns(), namely:
1. am65_cpsw_nuss_update_tx_rx_chns()
2. am65_cpsw_nuss_suspend()
Since both of them seem to invoke am65_cpsw_nuss_remove_tx_chns() only
in the case where am65_cpsw_nuss_init_tx_chns() *did not* error out, it
appears to me that "tx_chn->irq" will never be negative within
am65_cpsw_nuss_remove_tx_chns()

Please let me know if I have overlooked something.

Regards,
Siddharth.

> negative error value on error. So not NULL check is not sufficient
> to deteremine if IRQ is valid. Check that IRQ is greater then zero
> to ensure it is valid.
> 
> Fixes: 93a76530316a ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver")
> Signed-off-by: Roger Quadros <rogerq@kernel.org>
> ---
>  drivers/net/ethernet/ti/am65-cpsw-nuss.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
> index 5465bf872734..e1de45fb18ae 100644
> --- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c
> +++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
> @@ -2248,7 +2248,7 @@ static void am65_cpsw_nuss_remove_tx_chns(struct am65_cpsw_common *common)
>  	for (i = 0; i < common->tx_ch_num; i++) {
>  		struct am65_cpsw_tx_chn *tx_chn = &common->tx_chns[i];
>  
> -		if (tx_chn->irq)
> +		if (tx_chn->irq > 0)
>  			devm_free_irq(dev, tx_chn->irq, tx_chn);
>  
>  		netif_napi_del(&tx_chn->napi_tx);
> 
> ---
> base-commit: 5bc55a333a2f7316b58edc7573e8e893f7acb532
> change-id: 20250114-am65-cpsw-fix-tx-irq-free-846ac55ee6e1
> 
> Best regards,
> -- 
> Roger Quadros <rogerq@kernel.org>
>
Roger Quadros Jan. 15, 2025, 10:04 a.m. UTC | #3
Hi Siddharth,

On 15/01/2025 07:18, Siddharth Vadapalli wrote:
> On Tue, Jan 14, 2025 at 06:44:02PM +0200, Roger Quadros wrote:
> 
> Hello Roger,
> 
>> When getting the IRQ we use k3_udma_glue_rx_get_irq() which returns
> 
> You probably meant "k3_udma_glue_tx_get_irq()" instead? It is used to
> assign tx_chn->irq within am65_cpsw_nuss_init_tx_chns() as follows:

Yes I meant tx instead of rx.

> 
> 		tx_chn->irq = k3_udma_glue_tx_get_irq(tx_chn->tx_chn);
> 
> Additionally, following the above section we have:
> 
> 		if (tx_chn->irq < 0) {
> 			dev_err(dev, "Failed to get tx dma irq %d\n",
> 				tx_chn->irq);
> 			ret = tx_chn->irq;
> 			goto err;
> 		}
> 
> Could you please provide details on the code-path which will lead to a
> negative "tx_chn->irq" within "am65_cpsw_nuss_remove_tx_chns()"?
> 
> There seem to be two callers of am65_cpsw_nuss_remove_tx_chns(), namely:
> 1. am65_cpsw_nuss_update_tx_rx_chns()
> 2. am65_cpsw_nuss_suspend()
> Since both of them seem to invoke am65_cpsw_nuss_remove_tx_chns() only
> in the case where am65_cpsw_nuss_init_tx_chns() *did not* error out, it
> appears to me that "tx_chn->irq" will never be negative within
> am65_cpsw_nuss_remove_tx_chns()
> 
> Please let me know if I have overlooked something.

The issue is with am65_cpsw_nuss_update_tx_rx_chns(). It can be called
repeatedly (by user changing number of TX queues) even if previous call
to am65_cpsw_nuss_init_tx_chns() failed.

Please try the below patch to simulate the error condition.

Then do the following
- bring down all network interfaces
- change num TX queues to 2. IRQ for 2nd channel fails.
- change num TX queues to 3. Now we try to free an invalid IRQ and we see warning.

Also I think it is good practice to code for set value than to code
for existing code paths as they can change in the future.

--test patch starts--

diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
index 36c29d3db329..c22cadaaf3d3 100644
--- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c
+++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
@@ -155,7 +155,7 @@
 			 NETIF_MSG_IFUP	| NETIF_MSG_PROBE | NETIF_MSG_IFDOWN | \
 			 NETIF_MSG_RX_ERR | NETIF_MSG_TX_ERR)
 
-#define AM65_CPSW_DEFAULT_TX_CHNS	8
+#define AM65_CPSW_DEFAULT_TX_CHNS	1
 #define AM65_CPSW_DEFAULT_RX_CHN_FLOWS	1
 
 /* CPPI streaming packet interface */
@@ -2346,7 +2348,10 @@ static int am65_cpsw_nuss_init_tx_chns(struct am65_cpsw_common *common)
 		tx_chn->dsize_log2 = __fls(hdesc_size_out);
 		WARN_ON(hdesc_size_out != (1 << tx_chn->dsize_log2));
 
-		tx_chn->irq = k3_udma_glue_tx_get_irq(tx_chn->tx_chn);
+		if (i == 1)
+			tx_chn->irq = -ENODEV;
+		else
+			tx_chn->irq = k3_udma_glue_tx_get_irq(tx_chn->tx_chn);
 		if (tx_chn->irq < 0) {
 			dev_err(dev, "Failed to get tx dma irq %d\n",
 				tx_chn->irq);

--
cheers,
-roger
Siddharth Vadapalli Jan. 15, 2025, 10:38 a.m. UTC | #4
On Wed, Jan 15, 2025 at 12:04:17PM +0200, Roger Quadros wrote:
> Hi Siddharth,
> 
> On 15/01/2025 07:18, Siddharth Vadapalli wrote:
> > On Tue, Jan 14, 2025 at 06:44:02PM +0200, Roger Quadros wrote:
> > 
> > Hello Roger,
> > 
> >> When getting the IRQ we use k3_udma_glue_rx_get_irq() which returns
> > 
> > You probably meant "k3_udma_glue_tx_get_irq()" instead? It is used to
> > assign tx_chn->irq within am65_cpsw_nuss_init_tx_chns() as follows:
> 
> Yes I meant tx instead of rx.
> 
> > 
> > 		tx_chn->irq = k3_udma_glue_tx_get_irq(tx_chn->tx_chn);
> > 
> > Additionally, following the above section we have:
> > 
> > 		if (tx_chn->irq < 0) {
> > 			dev_err(dev, "Failed to get tx dma irq %d\n",
> > 				tx_chn->irq);
> > 			ret = tx_chn->irq;
> > 			goto err;
> > 		}
> > 
> > Could you please provide details on the code-path which will lead to a
> > negative "tx_chn->irq" within "am65_cpsw_nuss_remove_tx_chns()"?
> > 
> > There seem to be two callers of am65_cpsw_nuss_remove_tx_chns(), namely:
> > 1. am65_cpsw_nuss_update_tx_rx_chns()
> > 2. am65_cpsw_nuss_suspend()
> > Since both of them seem to invoke am65_cpsw_nuss_remove_tx_chns() only
> > in the case where am65_cpsw_nuss_init_tx_chns() *did not* error out, it
> > appears to me that "tx_chn->irq" will never be negative within
> > am65_cpsw_nuss_remove_tx_chns()
> > 
> > Please let me know if I have overlooked something.
> 
> The issue is with am65_cpsw_nuss_update_tx_rx_chns(). It can be called
> repeatedly (by user changing number of TX queues) even if previous call
> to am65_cpsw_nuss_init_tx_chns() failed.

Thank you for clarifying. So the issue/bug was discovered since the
implementation of am65_cpsw_nuss_update_tx_rx_chns(). The "Fixes" tag
misled me. Maybe the "Fixes" tag should be updated? Though we should
code to future-proof it as done in this patch, the "Fixes" tag pointing
to the very first commit of the driver might not be accurate as the
code-path associated with the bug cannot be exercised at that commit.

Independent of the above change suggested for the "Fixes" tag,

Reviewed-by: Siddharth Vadapalli <s-vadapalli@ti.com>

There seems to be a different bug in am65_cpsw_nuss_update_tx_rx_chns()
which I have described below.

> 
> Please try the below patch to simulate the error condition.
> 
> Then do the following
> - bring down all network interfaces
> - change num TX queues to 2. IRQ for 2nd channel fails.
> - change num TX queues to 3. Now we try to free an invalid IRQ and we see warning.
> 
> Also I think it is good practice to code for set value than to code
> for existing code paths as they can change in the future.
> 
> --test patch starts--
> 
> diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
> index 36c29d3db329..c22cadaaf3d3 100644
> --- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c
> +++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
> @@ -155,7 +155,7 @@
>  			 NETIF_MSG_IFUP	| NETIF_MSG_PROBE | NETIF_MSG_IFDOWN | \
>  			 NETIF_MSG_RX_ERR | NETIF_MSG_TX_ERR)
>  
> -#define AM65_CPSW_DEFAULT_TX_CHNS	8
> +#define AM65_CPSW_DEFAULT_TX_CHNS	1
>  #define AM65_CPSW_DEFAULT_RX_CHN_FLOWS	1
>  
>  /* CPPI streaming packet interface */
> @@ -2346,7 +2348,10 @@ static int am65_cpsw_nuss_init_tx_chns(struct am65_cpsw_common *common)
>  		tx_chn->dsize_log2 = __fls(hdesc_size_out);
>  		WARN_ON(hdesc_size_out != (1 << tx_chn->dsize_log2));
>  
> -		tx_chn->irq = k3_udma_glue_tx_get_irq(tx_chn->tx_chn);
> +		if (i == 1)
> +			tx_chn->irq = -ENODEV;
> +		else
> +			tx_chn->irq = k3_udma_glue_tx_get_irq(tx_chn->tx_chn);

The pair - am65_cpsw_nuss_init_tx_chns()/am65_cpsw_nuss_remove_tx_chns()
seem to be written under the assumption that failure will result in the
driver's probe failing.

With am65_cpsw_nuss_update_tx_rx_chns(), that assumption no longer holds
true. Please consider the following sequence:

1.
am65_cpsw_nuss_probe()
  am65_cpsw_nuss_register_ndevs()
    am65_cpsw_nuss_init_tx_chns() => Succeeds

2.
Probe is successful

3.
am65_cpsw_nuss_update_tx_rx_chns() => Invoked by user
  am65_cpsw_nuss_remove_tx_chns() => Succeeds
    am65_cpsw_nuss_init_tx_chns() => Partially fails
      devm_add_action(dev, am65_cpsw_nuss_free_tx_chns, common);
      ^ DEVM Action is added, but since the driver isn't removed,
      the cleanup via am65_cpsw_nuss_free_tx_chns() will not run.

Only when the user re-invokes am65_cpsw_nuss_update_tx_rx_chns(),
the cleanup will be performed. This might have to be fixed in the
following manner:

@@ -3416,10 +3416,17 @@ int am65_cpsw_nuss_update_tx_rx_chns(struct am65_cpsw_common *common,
        common->tx_ch_num = num_tx;
        common->rx_ch_num_flows = num_rx;
        ret = am65_cpsw_nuss_init_tx_chns(common);
-       if (ret)
+       if (ret) {
+               devm_remove_action(dev, am65_cpsw_nuss_free_tx_chns, common);
+               am65_cpsw_nuss_free_tx_chns(common);
                return ret;
+       }

        ret = am65_cpsw_nuss_init_rx_chns(common);
+       if (ret) {
+               devm_remove_action(dev, am65_cpsw_nuss_free_rx_chns, common);
+               am65_cpsw_nuss_free_rx_chns(common);
+       }

        return ret;
 }

 Please let me know what you think.


Regards,
Siddharth.
Roger Quadros Jan. 15, 2025, 3:49 p.m. UTC | #5
Hi Siddharth,

On 15/01/2025 12:38, Siddharth Vadapalli wrote:
> On Wed, Jan 15, 2025 at 12:04:17PM +0200, Roger Quadros wrote:
>> Hi Siddharth,
>>
>> On 15/01/2025 07:18, Siddharth Vadapalli wrote:
>>> On Tue, Jan 14, 2025 at 06:44:02PM +0200, Roger Quadros wrote:
>>>
>>> Hello Roger,
>>>
>>>> When getting the IRQ we use k3_udma_glue_rx_get_irq() which returns
>>>
>>> You probably meant "k3_udma_glue_tx_get_irq()" instead? It is used to
>>> assign tx_chn->irq within am65_cpsw_nuss_init_tx_chns() as follows:
>>
>> Yes I meant tx instead of rx.
>>
>>>
>>> 		tx_chn->irq = k3_udma_glue_tx_get_irq(tx_chn->tx_chn);
>>>
>>> Additionally, following the above section we have:
>>>
>>> 		if (tx_chn->irq < 0) {
>>> 			dev_err(dev, "Failed to get tx dma irq %d\n",
>>> 				tx_chn->irq);
>>> 			ret = tx_chn->irq;
>>> 			goto err;
>>> 		}
>>>
>>> Could you please provide details on the code-path which will lead to a
>>> negative "tx_chn->irq" within "am65_cpsw_nuss_remove_tx_chns()"?
>>>
>>> There seem to be two callers of am65_cpsw_nuss_remove_tx_chns(), namely:
>>> 1. am65_cpsw_nuss_update_tx_rx_chns()
>>> 2. am65_cpsw_nuss_suspend()
>>> Since both of them seem to invoke am65_cpsw_nuss_remove_tx_chns() only
>>> in the case where am65_cpsw_nuss_init_tx_chns() *did not* error out, it
>>> appears to me that "tx_chn->irq" will never be negative within
>>> am65_cpsw_nuss_remove_tx_chns()
>>>
>>> Please let me know if I have overlooked something.
>>
>> The issue is with am65_cpsw_nuss_update_tx_rx_chns(). It can be called
>> repeatedly (by user changing number of TX queues) even if previous call
>> to am65_cpsw_nuss_init_tx_chns() failed.
> 
> Thank you for clarifying. So the issue/bug was discovered since the
> implementation of am65_cpsw_nuss_update_tx_rx_chns(). The "Fixes" tag
> misled me. Maybe the "Fixes" tag should be updated? Though we should
> code to future-proof it as done in this patch, the "Fixes" tag pointing
> to the very first commit of the driver might not be accurate as the
> code-path associated with the bug cannot be exercised at that commit.

Fair enough. I'll change the Fixes commit.

> 
> Independent of the above change suggested for the "Fixes" tag,
> 
> Reviewed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
> 
> There seems to be a different bug in am65_cpsw_nuss_update_tx_rx_chns()
> which I have described below.
> 
>>
>> Please try the below patch to simulate the error condition.
>>
>> Then do the following
>> - bring down all network interfaces
>> - change num TX queues to 2. IRQ for 2nd channel fails.
>> - change num TX queues to 3. Now we try to free an invalid IRQ and we see warning.
>>
>> Also I think it is good practice to code for set value than to code
>> for existing code paths as they can change in the future.
>>
>> --test patch starts--
>>
>> diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
>> index 36c29d3db329..c22cadaaf3d3 100644
>> --- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c
>> +++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
>> @@ -155,7 +155,7 @@
>>  			 NETIF_MSG_IFUP	| NETIF_MSG_PROBE | NETIF_MSG_IFDOWN | \
>>  			 NETIF_MSG_RX_ERR | NETIF_MSG_TX_ERR)
>>  
>> -#define AM65_CPSW_DEFAULT_TX_CHNS	8
>> +#define AM65_CPSW_DEFAULT_TX_CHNS	1
>>  #define AM65_CPSW_DEFAULT_RX_CHN_FLOWS	1
>>  
>>  /* CPPI streaming packet interface */
>> @@ -2346,7 +2348,10 @@ static int am65_cpsw_nuss_init_tx_chns(struct am65_cpsw_common *common)
>>  		tx_chn->dsize_log2 = __fls(hdesc_size_out);
>>  		WARN_ON(hdesc_size_out != (1 << tx_chn->dsize_log2));
>>  
>> -		tx_chn->irq = k3_udma_glue_tx_get_irq(tx_chn->tx_chn);
>> +		if (i == 1)
>> +			tx_chn->irq = -ENODEV;
>> +		else
>> +			tx_chn->irq = k3_udma_glue_tx_get_irq(tx_chn->tx_chn);
> 
> The pair - am65_cpsw_nuss_init_tx_chns()/am65_cpsw_nuss_remove_tx_chns()
> seem to be written under the assumption that failure will result in the
> driver's probe failing.
> 
> With am65_cpsw_nuss_update_tx_rx_chns(), that assumption no longer holds
> true. Please consider the following sequence:
> 
> 1.
> am65_cpsw_nuss_probe()
>   am65_cpsw_nuss_register_ndevs()
>     am65_cpsw_nuss_init_tx_chns() => Succeeds
> 
> 2.
> Probe is successful
> 
> 3.
> am65_cpsw_nuss_update_tx_rx_chns() => Invoked by user
>   am65_cpsw_nuss_remove_tx_chns() => Succeeds
>     am65_cpsw_nuss_init_tx_chns() => Partially fails
>       devm_add_action(dev, am65_cpsw_nuss_free_tx_chns, common);
>       ^ DEVM Action is added, but since the driver isn't removed,
>       the cleanup via am65_cpsw_nuss_free_tx_chns() will not run.
> 
> Only when the user re-invokes am65_cpsw_nuss_update_tx_rx_chns(),
> the cleanup will be performed. This might have to be fixed in the
> following manner:
> 
> @@ -3416,10 +3416,17 @@ int am65_cpsw_nuss_update_tx_rx_chns(struct am65_cpsw_common *common,
>         common->tx_ch_num = num_tx;
>         common->rx_ch_num_flows = num_rx;
>         ret = am65_cpsw_nuss_init_tx_chns(common);
> -       if (ret)
> +       if (ret) {
> +               devm_remove_action(dev, am65_cpsw_nuss_free_tx_chns, common);
> +               am65_cpsw_nuss_free_tx_chns(common);
>                 return ret;
> +       }
> 
>         ret = am65_cpsw_nuss_init_rx_chns(common);
> +       if (ret) {
> +               devm_remove_action(dev, am65_cpsw_nuss_free_rx_chns, common);
> +               am65_cpsw_nuss_free_rx_chns(common);
> +       }
> 
>         return ret;
>  }
> 
>  Please let me know what you think.

I've already implemented a cleanup series to get rid of devm_add/remove_action,
cleanup probe error path and streamline TX and RQ queue init/cleanup.
I'll send out the series soon as soon as I finish some tests.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
index 5465bf872734..e1de45fb18ae 100644
--- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c
+++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
@@ -2248,7 +2248,7 @@  static void am65_cpsw_nuss_remove_tx_chns(struct am65_cpsw_common *common)
 	for (i = 0; i < common->tx_ch_num; i++) {
 		struct am65_cpsw_tx_chn *tx_chn = &common->tx_chns[i];
 
-		if (tx_chn->irq)
+		if (tx_chn->irq > 0)
 			devm_free_irq(dev, tx_chn->irq, tx_chn);
 
 		netif_napi_del(&tx_chn->napi_tx);