diff mbox series

mlx5 driver is broken when pci_msix_can_alloc_dyn() is false with v6.4-rc4

Message ID d6ada86741a440f99b5d6fedff532c8dbe86254f.camel@linux.ibm.com (mailing list archive)
State Awaiting Upstream
Delegated to: Netdev Maintainers
Headers show
Series mlx5 driver is broken when pci_msix_can_alloc_dyn() is false with v6.4-rc4 | expand

Checks

Context Check Description
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

Niklas Schnelle May 30, 2023, 1:04 p.m. UTC
Hi Saeed, Eli, Shay,

With v6.4-rc4 I'm getting a stream of RX and TX timeouts when trying to
use ConnectX-4 and ConnectX-6 VFs on s390. I've bisected this and found
the following commit to be broken:

commit 1da438c0ae02396dc5018b63237492cb5908608d
Author: Shay Drory <shayd@nvidia.com>
Date:   Mon Apr 17 10:57:50 2023 +0300

    net/mlx5: Fix indexing of mlx5_irq

    After the cited patch, mlx5_irq xarray index can be different then
    mlx5_irq MSIX table index.
    Fix it by storing both mlx5_irq xarray index and MSIX table index.

    Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
    Signed-off-by: Shay Drory <shayd@nvidia.com>
    Reviewed-by: Eli Cohen <elic@nvidia.com>
    Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

The problem is that our IRQs currently still use a legacy mode instead
of a full fledged IRQ domain. One consequence of that is that
pci_msix_can_alloc_dyn(dev->pdev) returns false. That lands us in the
non dynamic case in mlx5_irq_alloc() where irq->map.index is set to 0.
Now prior to the above commit irq->map.index would later be set to i
(the irq number) but that was replaced with just setting irq-
>pool_index = i. For the dynamic case this is fine because
pci_msix_alloc_irq_at() sets it but for the non-dynamic case this leave
it unset. With the following diff the RX/TX timeouts go away:

MSI_ANY_INDEX, af_desc);
                if (!irq->map.virq) {



I'll sent a patch with the above shortly but wanted to give you a heads
up since I'd really like to get this fixed for -rc5 or at least -rc6.

Thanks,
Niklas

Comments

Thorsten Leemhuis May 31, 2023, 1:33 p.m. UTC | #1
[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 30.05.23 15:04, Niklas Schnelle wrote:
> 
> With v6.4-rc4 I'm getting a stream of RX and TX timeouts when trying to
> use ConnectX-4 and ConnectX-6 VFs on s390. I've bisected this and found
> the following commit to be broken:
> 
> commit 1da438c0ae02396dc5018b63237492cb5908608d
> Author: Shay Drory <shayd@nvidia.com>
> Date:   Mon Apr 17 10:57:50 2023 +0300
> [...]

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 1da438c0ae02396dc5018b63237492cb5908608d
#regzbot title net/mlx5: RX and TX timeouts with ConnectX-4 and
ConnectX-6 VFs on s390
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.
Niklas Schnelle May 31, 2023, 1:43 p.m. UTC | #2
On Wed, 2023-05-31 at 15:33 +0200, Linux regression tracking #adding
(Thorsten Leemhuis) wrote:
> [CCing the regression list, as it should be in the loop for regressions:
> https://docs.kernel.org/admin-guide/reporting-regressions.html]
> 
> [TLDR: I'm adding this report to the list of tracked Linux kernel
> regressions; the text you find below is based on a few templates
> paragraphs you might have encountered already in similar form.
> See link in footer if these mails annoy you.]
> 
> On 30.05.23 15:04, Niklas Schnelle wrote:
> > 
> > With v6.4-rc4 I'm getting a stream of RX and TX timeouts when trying to
> > use ConnectX-4 and ConnectX-6 VFs on s390. I've bisected this and found
> > the following commit to be broken:
> > 
> > commit 1da438c0ae02396dc5018b63237492cb5908608d
> > Author: Shay Drory <shayd@nvidia.com>
> > Date:   Mon Apr 17 10:57:50 2023 +0300
> > [...]
> 
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
> 
> #regzbot ^introduced 1da438c0ae02396dc5018b63237492cb5908608d
> #regzbot title net/mlx5: RX and TX timeouts with ConnectX-4 and
> ConnectX-6 VFs on s390
> #regzbot ignore-activity
> 
> This isn't a regression? This issue or a fix for it are already
> discussed somewhere else? It was fixed already? You want to clarify when
> the regression started to happen? Or point out I got the title or
> something else totally wrong? Then just reply and tell me -- ideally
> while also telling regzbot about it, as explained by the page listed in
> the footer of this mail.
> 
> Developers: When fixing the issue, remember to add 'Link:' tags pointing
> to the report (the parent of this mail). See page linked in footer for
> details.
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> That page also explains what to do if mails like this annoy you.

Hi Thorsten,

Thanks for tracking. I actually already sent a fix patch (and v2) for
this. Sadly I forgot to link to this mail. Let's see if I can get the
regzbot command right to update it. As for the humans the latest fix
patch is here:

https://lore.kernel.org/netdev/20230531084856.2091666-1-schnelle@linux.ibm.com/

Thanks,
Niklas

#regzbot fix: net/mlx5: Fix setting of irq->map.index for static IRQ case
Thorsten Leemhuis May 31, 2023, 1:57 p.m. UTC | #3
On 31.05.23 15:43, Niklas Schnelle wrote:
> On Wed, 2023-05-31 at 15:33 +0200, Linux regression tracking #adding
> (Thorsten Leemhuis) wrote:
>> [CCing the regression list, as it should be in the loop for regressions:
>> https://docs.kernel.org/admin-guide/reporting-regressions.html]
>>
>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>> regressions; the text you find below is based on a few templates
>> paragraphs you might have encountered already in similar form.
>> See link in footer if these mails annoy you.]
>>
>> On 30.05.23 15:04, Niklas Schnelle wrote:
>>>
>>> With v6.4-rc4 I'm getting a stream of RX and TX timeouts when trying to
>>> use ConnectX-4 and ConnectX-6 VFs on s390. I've bisected this and found
>>> the following commit to be broken:
>>>
>>> commit 1da438c0ae02396dc5018b63237492cb5908608d
>>> Author: Shay Drory <shayd@nvidia.com>
>>> Date:   Mon Apr 17 10:57:50 2023 +0300
>>> [...]
>>
>> Thanks for the report. To be sure the issue doesn't fall through the
>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>> tracking bot:
>>
>> #regzbot ^introduced 1da438c0ae02396dc5018b63237492cb5908608d
>> #regzbot title net/mlx5: RX and TX timeouts with ConnectX-4 and
>> ConnectX-6 VFs on s390
>> #regzbot ignore-activity
>>
>> This isn't a regression? This issue or a fix for it are already
>> discussed somewhere else? It was fixed already? You want to clarify when
>> the regression started to happen? Or point out I got the title or
>> something else totally wrong? Then just reply and tell me -- ideally
>> while also telling regzbot about it, as explained by the page listed in
>> the footer of this mail.
>>
>> Developers: When fixing the issue, remember to add 'Link:' tags pointing
>> to the report (the parent of this mail). See page linked in footer for
>> details.
>>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> That page also explains what to do if mails like this annoy you.
> 
> Hi Thorsten,
> 
> Thanks for tracking. I actually already sent a fix patch (and v2) for
> this. Sadly I forgot to link to this mail. Let's see if I can get the
> regzbot command right to update it. As for the humans the latest fix
> patch is here:
> 
> https://lore.kernel.org/netdev/20230531084856.2091666-1-schnelle@linux.ibm.com/
> 
> Thanks,
> Niklas
> 
> #regzbot fix: net/mlx5: Fix setting of irq->map.index for static IRQ case

Looks right, many thx. Sorry, should have looked for that myself. Sadly
regzbot doesn't yet search for existing post on lore with a matching
subject, so for completeness let me point manually to it while at it:

#regzbot monitor:
https://lore.kernel.org/netdev/20230531084856.2091666-1-schnelle@linux.ibm.com/

Ciao, Thorsten
Saeed Mahameed May 31, 2023, 9:48 p.m. UTC | #4
On 31 May 15:57, Linux regression tracking (Thorsten Leemhuis) wrote:
>
>
>On 31.05.23 15:43, Niklas Schnelle wrote:
>> On Wed, 2023-05-31 at 15:33 +0200, Linux regression tracking #adding
>> (Thorsten Leemhuis) wrote:
>>> [CCing the regression list, as it should be in the loop for regressions:
>>> https://docs.kernel.org/admin-guide/reporting-regressions.html]
>>>
>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>>> regressions; the text you find below is based on a few templates
>>> paragraphs you might have encountered already in similar form.
>>> See link in footer if these mails annoy you.]
>>>
>>> On 30.05.23 15:04, Niklas Schnelle wrote:
>>>>
>>>> With v6.4-rc4 I'm getting a stream of RX and TX timeouts when trying to
>>>> use ConnectX-4 and ConnectX-6 VFs on s390. I've bisected this and found
>>>> the following commit to be broken:
>>>>
>>>> commit 1da438c0ae02396dc5018b63237492cb5908608d
>>>> Author: Shay Drory <shayd@nvidia.com>
>>>> Date:   Mon Apr 17 10:57:50 2023 +0300
>>>> [...]
>>>
>>> Thanks for the report. To be sure the issue doesn't fall through the
>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>> tracking bot:
>>>
>>> #regzbot ^introduced 1da438c0ae02396dc5018b63237492cb5908608d
>>> #regzbot title net/mlx5: RX and TX timeouts with ConnectX-4 and
>>> ConnectX-6 VFs on s390
>>> #regzbot ignore-activity
>>>
>>> This isn't a regression? This issue or a fix for it are already
>>> discussed somewhere else? It was fixed already? You want to clarify when
>>> the regression started to happen? Or point out I got the title or
>>> something else totally wrong? Then just reply and tell me -- ideally
>>> while also telling regzbot about it, as explained by the page listed in
>>> the footer of this mail.
>>>
>>> Developers: When fixing the issue, remember to add 'Link:' tags pointing
>>> to the report (the parent of this mail). See page linked in footer for
>>> details.
>>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>> --
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> That page also explains what to do if mails like this annoy you.
>>
>> Hi Thorsten,
>>
>> Thanks for tracking. I actually already sent a fix patch (and v2) for
>> this. Sadly I forgot to link to this mail. Let's see if I can get the
>> regzbot command right to update it. As for the humans the latest fix
>> patch is here:
>>
>> https://lore.kernel.org/netdev/20230531084856.2091666-1-schnelle@linux.ibm.com/
>>

I picked up this patch to net-mlx5 tree, will post to net shortly.

Thanks,
Saeed.

>> Thanks,
>> Niklas
>>
>> #regzbot fix: net/mlx5: Fix setting of irq->map.index for static IRQ case
>
>Looks right, many thx. Sorry, should have looked for that myself. Sadly
>regzbot doesn't yet search for existing post on lore with a matching
>subject, so for completeness let me point manually to it while at it:
>
>#regzbot monitor:
>https://lore.kernel.org/netdev/20230531084856.2091666-1-schnelle@linux.ibm.com/
>
>Ciao, Thorsten
>
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index db5687d9fec9..94dce3735204 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -237,7 +237,7 @@  struct mlx5_irq *mlx5_irq_alloc(struct
mlx5_irq_pool *pool, int i,
                 * vectors have also been allocated.
                 */
                irq->map.virq = pci_irq_vector(dev->pdev, i);
-               irq->map.index = 0;
+               irq->map.index = i;
        } else {
                irq->map = pci_msix_alloc_irq_at(dev->pdev,