diff mbox

[mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

Message ID 20180716083012.15410-1-leon@kernel.org (mailing list archive)
State Superseded
Delegated to: Jason Gunthorpe
Headers show

Commit Message

Leon Romanovsky July 16, 2018, 8:30 a.m. UTC
From: Leon Romanovsky <leonro@mellanox.com>

The IRQ affinity mask is managed by mlx5_core, however any user
triggered updates through /proc/irq/<irq#>/smp_affinity were not
reflected in mlx5_ib_get_vector_affinity().

Drop the attempt to use cached version of affinity mask in favour of
managed by PCI core value.

Fixes: e3ca34880652 ("net/mlx5: Fix build break when CONFIG_SMP=n")
Reported-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/hw/mlx5/main.c | 4 +++-
 include/linux/mlx5/driver.h       | 7 -------
 2 files changed, 3 insertions(+), 8 deletions(-)

--
2.14.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Sagi Grimberg July 16, 2018, 10:23 a.m. UTC | #1
Leon, I'd like to see a tested-by tag for this (at least
until I get some time to test it).

The patch itself looks fine to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky July 16, 2018, 10:30 a.m. UTC | #2
On Mon, Jul 16, 2018 at 01:23:24PM +0300, Sagi Grimberg wrote:
> Leon, I'd like to see a tested-by tag for this (at least
> until I get some time to test it).

Of course.

Thanks

>
> The patch itself looks fine to me.
Max Gurtovoy July 16, 2018, 2:54 p.m. UTC | #3
Hi,
I've tested this patch and seems problematic at this moment.
maybe this is because of the bug that Steve mentioned in the NVMe 
mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA 
initiator and I'll run his suggestion as well.
BTW, when I run the blk_mq_map_queues it works for every irq affinity.

On 7/16/2018 1:30 PM, Leon Romanovsky wrote:
> On Mon, Jul 16, 2018 at 01:23:24PM +0300, Sagi Grimberg wrote:
>> Leon, I'd like to see a tested-by tag for this (at least
>> until I get some time to test it).
> 
> Of course.
> 
> Thanks
> 
>>
>> The patch itself looks fine to me.


-Max.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 16, 2018, 2:59 p.m. UTC | #4
> Hi,
> I've tested this patch and seems problematic at this moment.

Problematic how? what are you seeing?

> maybe this is because of the bug that Steve mentioned in the NVMe 
> mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA 
> initiator and I'll run his suggestion as well.

Is your device irq affinity linear?

> BTW, when I run the blk_mq_map_queues it works for every irq affinity.

But its probably not aligned to the device vector affinity.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Max Gurtovoy July 16, 2018, 4:46 p.m. UTC | #5
On 7/16/2018 5:59 PM, Sagi Grimberg wrote:
> 
>> Hi,
>> I've tested this patch and seems problematic at this moment.
> 
> Problematic how? what are you seeing?

Connection failures and same error Steve saw:

[Mon Jul 16 16:19:11 2018] nvme nvme0: Connect command failed, error 
wo/DNR bit: -16402
[Mon Jul 16 16:19:11 2018] nvme nvme0: failed to connect queue: 2 ret=-18


> 
>> maybe this is because of the bug that Steve mentioned in the NVMe 
>> mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA 
>> initiator and I'll run his suggestion as well.
> 
> Is your device irq affinity linear?

When it's linear and the balancer is stopped the patch works.

> 
>> BTW, when I run the blk_mq_map_queues it works for every irq affinity.
> 
> But its probably not aligned to the device vector affinity.

but I guess it's better in some cases.

I've checked the situation before Leon's patch and set all the vetcors 
to CPU 0. In this case (I think that this was the initial report by 
Steve), we use the affinity_hint (Israel's and Saeed's patches were we 
use dev->priv.irq_info[vector].mask) and it worked fine.

Steve,
Can you share your configuration (kernel, HCA, affinity map, connect 
command, lscpu) ?
I want to repro it in my lab.

-Max.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 16, 2018, 5:08 p.m. UTC | #6
Hey Max:


On 7/16/2018 11:46 AM, Max Gurtovoy wrote:
>
>
> On 7/16/2018 5:59 PM, Sagi Grimberg wrote:
>>
>>> Hi,
>>> I've tested this patch and seems problematic at this moment.
>>
>> Problematic how? what are you seeing?
>
> Connection failures and same error Steve saw:
>
> [Mon Jul 16 16:19:11 2018] nvme nvme0: Connect command failed, error
> wo/DNR bit: -16402
> [Mon Jul 16 16:19:11 2018] nvme nvme0: failed to connect queue: 2 ret=-18
>
>
>>
>>> maybe this is because of the bug that Steve mentioned in the NVMe
>>> mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA
>>> initiator and I'll run his suggestion as well.
>>
>> Is your device irq affinity linear?
>
> When it's linear and the balancer is stopped the patch works.
>
>>
>>> BTW, when I run the blk_mq_map_queues it works for every irq affinity.
>>
>> But its probably not aligned to the device vector affinity.
>
> but I guess it's better in some cases.
>
> I've checked the situation before Leon's patch and set all the vetcors
> to CPU 0. In this case (I think that this was the initial report by
> Steve), we use the affinity_hint (Israel's and Saeed's patches were we
> use dev->priv.irq_info[vector].mask) and it worked fine.
>
> Steve,
> Can you share your configuration (kernel, HCA, affinity map, connect
> command, lscpu) ?
> I want to repro it in my lab.
>

- linux-4.18-rc1 + the nvme/nvmet inline_data_size patches + patches to
enable ib_get_vector_affinity() in cxgb4 + sagi's patch + leon's mlx5
patch so I can change the affinity via procfs. 

- mlx5 MT27700 RoCE card, cxgb4 T62100-CR iWARP card

- The system has 2 numa nodes with 8 real cpus in each == 16 cpus all
online.  HT disabled.

- i'm testing over HW loopback for simplicity, so the node is both the
nvme target and host.  Connecting one device like this: nvme connect -t
rdma -a 172.16.2.1 -n nvme-nullb0

- to reproduce the nvme-rdma bug, just map any two hca cq comp vectors
to the same cpu. 

- lscpu output:

[root@stevo1 linux]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 45
Model name:            Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz
Stepping:              7
CPU MHz:               3400.057
CPU max MHz:           3800.0000
CPU min MHz:           1200.0000
BogoMIPS:              6200.10
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2
x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti
tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts

Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Max Gurtovoy July 17, 2018, 8:46 a.m. UTC | #7
On 7/16/2018 8:08 PM, Steve Wise wrote:
> Hey Max:
> 
> 

Hey,

> On 7/16/2018 11:46 AM, Max Gurtovoy wrote:
>>
>>
>> On 7/16/2018 5:59 PM, Sagi Grimberg wrote:
>>>
>>>> Hi,
>>>> I've tested this patch and seems problematic at this moment.
>>>
>>> Problematic how? what are you seeing?
>>
>> Connection failures and same error Steve saw:
>>
>> [Mon Jul 16 16:19:11 2018] nvme nvme0: Connect command failed, error
>> wo/DNR bit: -16402
>> [Mon Jul 16 16:19:11 2018] nvme nvme0: failed to connect queue: 2 ret=-18
>>
>>
>>>
>>>> maybe this is because of the bug that Steve mentioned in the NVMe
>>>> mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA
>>>> initiator and I'll run his suggestion as well.
>>>
>>> Is your device irq affinity linear?
>>
>> When it's linear and the balancer is stopped the patch works.
>>
>>>
>>>> BTW, when I run the blk_mq_map_queues it works for every irq affinity.
>>>
>>> But its probably not aligned to the device vector affinity.
>>
>> but I guess it's better in some cases.
>>
>> I've checked the situation before Leon's patch and set all the vetcors
>> to CPU 0. In this case (I think that this was the initial report by
>> Steve), we use the affinity_hint (Israel's and Saeed's patches were we
>> use dev->priv.irq_info[vector].mask) and it worked fine.
>>
>> Steve,
>> Can you share your configuration (kernel, HCA, affinity map, connect
>> command, lscpu) ?
>> I want to repro it in my lab.
>>
> 
> - linux-4.18-rc1 + the nvme/nvmet inline_data_size patches + patches to
> enable ib_get_vector_affinity() in cxgb4 + sagi's patch + leon's mlx5
> patch so I can change the affinity via procfs.

ohh, now I understand that you where complaining regarding the affinity 
change reflection to mlx5_ib_get_vector_affinity and not regarding the 
failures on connecting while the affinity overlaps (that is working good 
before Leon's patch).
So this is a known issue since we used a static hint that never changes
from dev->priv.irq_info[vector].mask.

IMO we must fulfil the user wish to connect to N queues and not reduce 
it because of affinity overlaps. So in order to push Leon's patch we 
must also fix the blk_mq_rdma_map_queues to do a best effort mapping 
according the affinity and map the rest in naive way (in that way we 
will *always* map all the queues).

-Max.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky July 17, 2018, 8:58 a.m. UTC | #8
On Tue, Jul 17, 2018 at 11:46:40AM +0300, Max Gurtovoy wrote:
>
>
> On 7/16/2018 8:08 PM, Steve Wise wrote:
> > Hey Max:
> >
> >
>
> Hey,
>
> > On 7/16/2018 11:46 AM, Max Gurtovoy wrote:
> > >
> > >
> > > On 7/16/2018 5:59 PM, Sagi Grimberg wrote:
> > > >
> > > > > Hi,
> > > > > I've tested this patch and seems problematic at this moment.
> > > >
> > > > Problematic how? what are you seeing?
> > >
> > > Connection failures and same error Steve saw:
> > >
> > > [Mon Jul 16 16:19:11 2018] nvme nvme0: Connect command failed, error
> > > wo/DNR bit: -16402
> > > [Mon Jul 16 16:19:11 2018] nvme nvme0: failed to connect queue: 2 ret=-18
> > >
> > >
> > > >
> > > > > maybe this is because of the bug that Steve mentioned in the NVMe
> > > > > mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA
> > > > > initiator and I'll run his suggestion as well.
> > > >
> > > > Is your device irq affinity linear?
> > >
> > > When it's linear and the balancer is stopped the patch works.
> > >
> > > >
> > > > > BTW, when I run the blk_mq_map_queues it works for every irq affinity.
> > > >
> > > > But its probably not aligned to the device vector affinity.
> > >
> > > but I guess it's better in some cases.
> > >
> > > I've checked the situation before Leon's patch and set all the vetcors
> > > to CPU 0. In this case (I think that this was the initial report by
> > > Steve), we use the affinity_hint (Israel's and Saeed's patches were we
> > > use dev->priv.irq_info[vector].mask) and it worked fine.
> > >
> > > Steve,
> > > Can you share your configuration (kernel, HCA, affinity map, connect
> > > command, lscpu) ?
> > > I want to repro it in my lab.
> > >
> >
> > - linux-4.18-rc1 + the nvme/nvmet inline_data_size patches + patches to
> > enable ib_get_vector_affinity() in cxgb4 + sagi's patch + leon's mlx5
> > patch so I can change the affinity via procfs.
>
> ohh, now I understand that you where complaining regarding the affinity
> change reflection to mlx5_ib_get_vector_affinity and not regarding the
> failures on connecting while the affinity overlaps (that is working good
> before Leon's patch).
> So this is a known issue since we used a static hint that never changes
> from dev->priv.irq_info[vector].mask.
>
> IMO we must fulfil the user wish to connect to N queues and not reduce it
> because of affinity overlaps. So in order to push Leon's patch we must
> also fix the blk_mq_rdma_map_queues to do a best effort mapping according
> the affinity and map the rest in naive way (in that way we will *always*
> map all the queues).

Max,

I have no clue what is needed to do int blq_mq*, but my patch only gave
to users ability reconfigure their affinity mask after driver is loaded.

Thanks

>
> -Max.
>
Max Gurtovoy July 17, 2018, 10:05 a.m. UTC | #9
On 7/17/2018 11:58 AM, Leon Romanovsky wrote:
> On Tue, Jul 17, 2018 at 11:46:40AM +0300, Max Gurtovoy wrote:
>>
>>
>> On 7/16/2018 8:08 PM, Steve Wise wrote:
>>> Hey Max:
>>>
>>>
>>
>> Hey,
>>
>>> On 7/16/2018 11:46 AM, Max Gurtovoy wrote:
>>>>
>>>>
>>>> On 7/16/2018 5:59 PM, Sagi Grimberg wrote:
>>>>>
>>>>>> Hi,
>>>>>> I've tested this patch and seems problematic at this moment.
>>>>>
>>>>> Problematic how? what are you seeing?
>>>>
>>>> Connection failures and same error Steve saw:
>>>>
>>>> [Mon Jul 16 16:19:11 2018] nvme nvme0: Connect command failed, error
>>>> wo/DNR bit: -16402
>>>> [Mon Jul 16 16:19:11 2018] nvme nvme0: failed to connect queue: 2 ret=-18
>>>>
>>>>
>>>>>
>>>>>> maybe this is because of the bug that Steve mentioned in the NVMe
>>>>>> mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA
>>>>>> initiator and I'll run his suggestion as well.
>>>>>
>>>>> Is your device irq affinity linear?
>>>>
>>>> When it's linear and the balancer is stopped the patch works.
>>>>
>>>>>
>>>>>> BTW, when I run the blk_mq_map_queues it works for every irq affinity.
>>>>>
>>>>> But its probably not aligned to the device vector affinity.
>>>>
>>>> but I guess it's better in some cases.
>>>>
>>>> I've checked the situation before Leon's patch and set all the vetcors
>>>> to CPU 0. In this case (I think that this was the initial report by
>>>> Steve), we use the affinity_hint (Israel's and Saeed's patches were we
>>>> use dev->priv.irq_info[vector].mask) and it worked fine.
>>>>
>>>> Steve,
>>>> Can you share your configuration (kernel, HCA, affinity map, connect
>>>> command, lscpu) ?
>>>> I want to repro it in my lab.
>>>>
>>>
>>> - linux-4.18-rc1 + the nvme/nvmet inline_data_size patches + patches to
>>> enable ib_get_vector_affinity() in cxgb4 + sagi's patch + leon's mlx5
>>> patch so I can change the affinity via procfs.
>>
>> ohh, now I understand that you where complaining regarding the affinity
>> change reflection to mlx5_ib_get_vector_affinity and not regarding the
>> failures on connecting while the affinity overlaps (that is working good
>> before Leon's patch).
>> So this is a known issue since we used a static hint that never changes
>> from dev->priv.irq_info[vector].mask.
>>
>> IMO we must fulfil the user wish to connect to N queues and not reduce it
>> because of affinity overlaps. So in order to push Leon's patch we must
>> also fix the blk_mq_rdma_map_queues to do a best effort mapping according
>> the affinity and map the rest in naive way (in that way we will *always*
>> map all the queues).
> 
> Max,
> 
> I have no clue what is needed to do int blq_mq*, but my patch only gave
> to users ability reconfigure their affinity mask after driver is loaded.

Yes I know, but since the only user of this API is the blk-mq-rdma and 
the nvme_rdma driver that can't establish connection with your patch - I 
suggest to wait with pushing it and fix the mapping of 
blk_mq_rdma_map_queues

> 
> Thanks
> 
>>
>> -Max.
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 17, 2018, 1:03 p.m. UTC | #10
> On 7/16/2018 8:08 PM, Steve Wise wrote:
> > Hey Max:
> >
> >
> 
> Hey,
> 
> > On 7/16/2018 11:46 AM, Max Gurtovoy wrote:
> >>
> >>
> >> On 7/16/2018 5:59 PM, Sagi Grimberg wrote:
> >>>
> >>>> Hi,
> >>>> I've tested this patch and seems problematic at this moment.
> >>>
> >>> Problematic how? what are you seeing?
> >>
> >> Connection failures and same error Steve saw:
> >>
> >> [Mon Jul 16 16:19:11 2018] nvme nvme0: Connect command failed, error
> >> wo/DNR bit: -16402
> >> [Mon Jul 16 16:19:11 2018] nvme nvme0: failed to connect queue: 2 ret=-
> 18
> >>
> >>
> >>>
> >>>> maybe this is because of the bug that Steve mentioned in the NVMe
> >>>> mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA
> >>>> initiator and I'll run his suggestion as well.
> >>>
> >>> Is your device irq affinity linear?
> >>
> >> When it's linear and the balancer is stopped the patch works.
> >>
> >>>
> >>>> BTW, when I run the blk_mq_map_queues it works for every irq
> affinity.
> >>>
> >>> But its probably not aligned to the device vector affinity.
> >>
> >> but I guess it's better in some cases.
> >>
> >> I've checked the situation before Leon's patch and set all the vetcors
> >> to CPU 0. In this case (I think that this was the initial report by
> >> Steve), we use the affinity_hint (Israel's and Saeed's patches were we
> >> use dev->priv.irq_info[vector].mask) and it worked fine.
> >>
> >> Steve,
> >> Can you share your configuration (kernel, HCA, affinity map, connect
> >> command, lscpu) ?
> >> I want to repro it in my lab.
> >>
> >
> > - linux-4.18-rc1 + the nvme/nvmet inline_data_size patches + patches to
> > enable ib_get_vector_affinity() in cxgb4 + sagi's patch + leon's mlx5
> > patch so I can change the affinity via procfs.
> 
> ohh, now I understand that you where complaining regarding the affinity
> change reflection to mlx5_ib_get_vector_affinity and not regarding the
> failures on connecting while the affinity overlaps (that is working good
> before Leon's patch).
> So this is a known issue since we used a static hint that never changes
> from dev->priv.irq_info[vector].mask.
> 
> IMO we must fulfil the user wish to connect to N queues and not reduce
> it because of affinity overlaps. So in order to push Leon's patch we
> must also fix the blk_mq_rdma_map_queues to do a best effort mapping
> according the affinity and map the rest in naive way (in that way we
> will *always* map all the queues).

That is what I would expect also.   For example, in my node, where there are
16 cpus, and 2 numa nodes, I observe much better nvmf IOPS performance by
setting up my 16 driver completion event queues such that each is bound to a
node-local cpu.  So I end up with each nodel-local cpu having 2 queues bound
to it.   W/O adding support in iw_cxgb4 for ib_get_vector_affinity(), this
works fine.   I assumed adding ib_get_vector_affinity() would allow this to
all "just work" by default, but I'm running into this connection failure
issue. 

I don't understand exactly what the blk_mq layer is trying to do, but I
assume it has ingress event queues and processing that it trying to align
with the drivers ingress cq event handling, so everybody stays on the same
cpu (or at least node).   But something else is going on.  Is there
documentation on how this works somewhere?

Thanks,

Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index d0834525afe3..1c3584024acb 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -5304,8 +5304,10 @@  static const struct cpumask *
 mlx5_ib_get_vector_affinity(struct ib_device *ibdev, int comp_vector)
 {
 	struct mlx5_ib_dev *dev = to_mdev(ibdev);
+	int irq = pci_irq_vector(dev->mdev->pdev,
+				 MLX5_EQ_VEC_COMP_BASE + comp_vector);

-	return mlx5_get_vector_affinity_hint(dev->mdev, comp_vector);
+	return irq_get_affinity_mask(irq);
 }

 /* The mlx5_ib_multiport_mutex should be held when calling this function */
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 0b7daa4a8f84..d3581cd5d517 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1287,11 +1287,4 @@  static inline int mlx5_core_native_port_num(struct mlx5_core_dev *dev)
 enum {
 	MLX5_TRIGGERED_CMD_COMP = (u64)1 << 32,
 };
-
-static inline const struct cpumask *
-mlx5_get_vector_affinity_hint(struct mlx5_core_dev *dev, int vector)
-{
-	return dev->priv.irq_info[vector].mask;
-}
-
 #endif /* MLX5_DRIVER_H */