diff mbox series

[v2] block: fix rdma queue mapping

Message ID 20180820205420.25908-1-sagi@grimberg.me (mailing list archive)
State Not Applicable
Headers show
Series [v2] block: fix rdma queue mapping | expand

Commit Message

Sagi Grimberg Aug. 20, 2018, 8:54 p.m. UTC
nvme-rdma attempts to map queues based on irq vector affinity.
However, for some devices, completion vector irq affinity is
configurable by the user which can break the existing assumption
that irq vectors are optimally arranged over the host cpu cores.

So we map queues in two stages:
First map queues according to corresponding to the completion
vector IRQ affinity taking the first cpu in the vector affinity map.
if the current irq affinity is arranged such that a vector is not
assigned to any distinct cpu, we map it to a cpu that is on the same
node. If numa affinity can not be sufficed, we map it to any unmapped
cpu we can find. Then, map the remaining cpus in the possible cpumap
naively.

Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
Changes from v1:
- fixed double semicolon typo

 block/blk-mq-cpumap.c  | 39 +++++++++++---------
 block/blk-mq-rdma.c    | 80 ++++++++++++++++++++++++++++++++++++------
 include/linux/blk-mq.h |  1 +
 3 files changed, 93 insertions(+), 27 deletions(-)

Comments

Ming Lei Aug. 21, 2018, 2:04 a.m. UTC | #1
On Mon, Aug 20, 2018 at 01:54:20PM -0700, Sagi Grimberg wrote:
> nvme-rdma attempts to map queues based on irq vector affinity.
> However, for some devices, completion vector irq affinity is
> configurable by the user which can break the existing assumption
> that irq vectors are optimally arranged over the host cpu cores.
> 
> So we map queues in two stages:
> First map queues according to corresponding to the completion
> vector IRQ affinity taking the first cpu in the vector affinity map.
> if the current irq affinity is arranged such that a vector is not
> assigned to any distinct cpu, we map it to a cpu that is on the same
> node. If numa affinity can not be sufficed, we map it to any unmapped
> cpu we can find. Then, map the remaining cpus in the possible cpumap
> naively.

I guess this way still can't fix the request allocation crash issue
triggered by using blk_mq_alloc_request_hctx(), in which one hw queue may
not be mapped from any online CPU.

Maybe this patch isn't for this issue, but it is closely related.

Thanks,
Ming
Christoph Hellwig Aug. 22, 2018, 1:11 p.m. UTC | #2
On Mon, Aug 20, 2018 at 01:54:20PM -0700, Sagi Grimberg wrote:
> nvme-rdma attempts to map queues based on irq vector affinity.
> However, for some devices, completion vector irq affinity is
> configurable by the user which can break the existing assumption
> that irq vectors are optimally arranged over the host cpu cores.

IFF affinity is configurable we should never use this code,
as it breaks the model entirely.  ib_get_vector_affinity should
never return a valid mask if affinity is configurable.
Sagi Grimberg Aug. 25, 2018, 2:06 a.m. UTC | #3
> I guess this way still can't fix the request allocation crash issue
> triggered by using blk_mq_alloc_request_hctx(), in which one hw queue may
> not be mapped from any online CPU.

Not really. I guess we will need to simply skip queues that are
mapped to an offline cpu.

> Maybe this patch isn't for this issue, but it is closely related.

Yes, another patch is still needed.

Steve, do you still have that patch? I don't seem to
find it anywhere.
Sagi Grimberg Aug. 25, 2018, 2:17 a.m. UTC | #4
>> nvme-rdma attempts to map queues based on irq vector affinity.
>> However, for some devices, completion vector irq affinity is
>> configurable by the user which can break the existing assumption
>> that irq vectors are optimally arranged over the host cpu cores.
> 
> IFF affinity is configurable we should never use this code,
> as it breaks the model entirely.  ib_get_vector_affinity should
> never return a valid mask if affinity is configurable.

I agree that the model intended initially doesn't fit. But it seems
that some users like to write into their nic's
/proc/irq/$IRP/smp_affinity and get mad at us for not letting them with
using managed affinity.

So instead of falling back to the block mapping function we try
to do a little better first:
1. map according to the device vector affinity
2. map vectors that end up without a mapping to cpus that belong
    to the same numa-node
3. map all the rest of the unmapped cpus like the block layer
    would do.

We could have device drivers that don't use managed affinity to never
return a valid mask but that would never allow affinity based mapping
which is optimal at least for users that do not mangle with device
irq affinity (which is probably the majority of users).

Thoughts?
Steve Wise Aug. 25, 2018, 12:18 p.m. UTC | #5
> > I guess this way still can't fix the request allocation crash issue
> > triggered by using blk_mq_alloc_request_hctx(), in which one hw queue
> may
> > not be mapped from any online CPU.
> 
> Not really. I guess we will need to simply skip queues that are
> mapped to an offline cpu.
> 
> > Maybe this patch isn't for this issue, but it is closely related.
> 
> Yes, another patch is still needed.
> 
> Steve, do you still have that patch? I don't seem to
> find it anywhere.

I have no such patch.  I don't remember this issue.

Steve.
Ming Lei Aug. 27, 2018, 3:50 a.m. UTC | #6
On Sat, Aug 25, 2018 at 07:18:43AM -0500, Steve Wise wrote:
> > > I guess this way still can't fix the request allocation crash issue
> > > triggered by using blk_mq_alloc_request_hctx(), in which one hw queue
> > may
> > > not be mapped from any online CPU.
> > 
> > Not really. I guess we will need to simply skip queues that are
> > mapped to an offline cpu.
> > 
> > > Maybe this patch isn't for this issue, but it is closely related.
> > 
> > Yes, another patch is still needed.
> > 
> > Steve, do you still have that patch? I don't seem to
> > find it anywhere.
> 
> I have no such patch.  I don't remember this issue.

This issue can be reproduced when running IO by doing CPU hotplug, then
the following log can be triggered:

[  396.629000] smpboot: CPU 2 is now offline
[  396.640759] nvmet: creating controller 1 for subsystem testnqn for NQN nqn.2014-08.org.nvmexpress:uuid:aa7d7d4a-0ee1-4960-ad0d-dbfe768b88d4.
[  396.642036] nvme nvme1: creating 1 I/O queues.
[  396.642480] BUG: unable to handle kernel paging request at 000060fd0a6bff48
[  396.643128] PGD 0 P4D 0
[  396.643364] Oops: 0002 [#1] PREEMPT SMP PTI
[  396.643774] CPU: 3 PID: 7 Comm: kworker/u8:0 Not tainted 4.18.0_2923b27e5424_master+ #1
[  396.644588] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
[  396.645597] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop]
[  396.646371] RIP: 0010:blk_mq_get_request+0x2e2/0x375
[  396.646924] Code: 00 00 48 c7 83 28 01 00 00 00 00 00 00 48 c7 83 30 01 00 00 00 00 00 00 48 8b 55 10 74 0c 31 c0 41 f7 c4 00 08 06 00 0f 95 c0 <48> ff 44 c2 48 41 81 e4 00 00 06 00 c7 83 d8 00 00 00 01 00 00 00
[  396.648699] RSP: 0018:ffffc90000c87cc8 EFLAGS: 00010246
[  396.649212] RAX: 0000000000000000 RBX: ffff880276350000 RCX: 0000000000000017
[  396.649931] RDX: 000060fd0a6bff00 RSI: 000000000010e9e6 RDI: 00155563b3000000
[  396.650646] RBP: ffffc90000c87d08 R08: 00000000f461b8ce R09: 000000000000006c
[  396.651393] R10: ffffc90000c87e50 R11: 0000000000000000 R12: 0000000000000023
[  396.652100] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  396.652831] FS:  0000000000000000(0000) GS:ffff880277b80000(0000) knlGS:0000000000000000
[  396.653652] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  396.654254] CR2: 000060fd0a6bff48 CR3: 000000000200a006 CR4: 0000000000760ee0
[  396.655051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  396.655841] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  396.656542] PKRU: 55555554
[  396.656808] Call Trace:
[  396.657094]  blk_mq_alloc_request_hctx+0xd1/0x11c
[  396.657599]  ? pcpu_block_update_hint_alloc+0xa1/0x1a5
[  396.658101]  nvme_alloc_request+0x42/0x71
[  396.658517]  __nvme_submit_sync_cmd+0x2d/0xd1
[  396.658945]  nvmf_connect_io_queue+0x10b/0x162 [nvme_fabrics]
[  396.659543]  ? nvme_loop_connect_io_queues+0x2d/0x52 [nvme_loop]
[  396.660126]  nvme_loop_connect_io_queues+0x2d/0x52 [nvme_loop]
[  396.660743]  nvme_loop_reset_ctrl_work+0x62/0xcf [nvme_loop]
[  396.661296]  process_one_work+0x1c9/0x2f6
[  396.661694]  ? rescuer_thread+0x282/0x282
[  396.662080]  process_scheduled_works+0x27/0x2c
[  396.662510]  worker_thread+0x1e7/0x295
[  396.662960]  kthread+0x115/0x11d
[  396.663334]  ? kthread_park+0x76/0x76
[  396.663752]  ret_from_fork+0x35/0x40
[  396.664169] Modules linked in: nvme_loop nvmet nvme_fabrics null_blk scsi_debug isofs iTCO_wdt iTCO_vendor_support i2c_i801 lpc_ich i2c_core mfd_core ip_tables sr_mod cdrom usb_storage sd_mod ahci libahci libata crc32c_intel virtio_scsi qemu_fw_cfg dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
[  396.667015] Dumping ftrace buffer:
[  396.667348]    (ftrace buffer empty)
[  396.667758] CR2: 000060fd0a6bff48
[  396.668129] ---[ end trace 887712785c99c2ca ]---


Thanks,
Ming
Steve Wise Oct. 3, 2018, 7:05 p.m. UTC | #7
On 8/24/2018 9:17 PM, Sagi Grimberg wrote:
> 
>>> nvme-rdma attempts to map queues based on irq vector affinity.
>>> However, for some devices, completion vector irq affinity is
>>> configurable by the user which can break the existing assumption
>>> that irq vectors are optimally arranged over the host cpu cores.
>>
>> IFF affinity is configurable we should never use this code,
>> as it breaks the model entirely.  ib_get_vector_affinity should
>> never return a valid mask if affinity is configurable.
> 
> I agree that the model intended initially doesn't fit. But it seems
> that some users like to write into their nic's
> /proc/irq/$IRP/smp_affinity and get mad at us for not letting them with
> using managed affinity.
> 
> So instead of falling back to the block mapping function we try
> to do a little better first:
> 1. map according to the device vector affinity
> 2. map vectors that end up without a mapping to cpus that belong
>    to the same numa-node
> 3. map all the rest of the unmapped cpus like the block layer
>    would do.
> 
> We could have device drivers that don't use managed affinity to never
> return a valid mask but that would never allow affinity based mapping
> which is optimal at least for users that do not mangle with device
> irq affinity (which is probably the majority of users).
> 
> Thoughts?

Can we please make forward progress on this?

Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
shouldn't be allowed if drivers support managed affinity. Is that correct?

Perhaps that can be codified and be a way forward?  IE: Somehow allow
the admin to choose either "managed by the driver/ulps" or "managed by
the system admin directly"?

Or just use Sagi's patch.  Perhaps a WARN_ONCE() if the affinity looks
wonked when set via procfs?  Just thinking out loud...

But as it stands, things are just plain borked if an rdma driver
supports ib_get_vector_affinity() yet the admin changes the affinity via
/proc...

Thanks,

Steve.
Sagi Grimberg Oct. 3, 2018, 9:14 p.m. UTC | #8
> Can we please make forward progress on this?
> 
> Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
> shouldn't be allowed if drivers support managed affinity. Is that correct?
> 
> Perhaps that can be codified and be a way forward?  IE: Somehow allow
> the admin to choose either "managed by the driver/ulps" or "managed by
> the system admin directly"?
> 
> Or just use Sagi's patch.  Perhaps a WARN_ONCE() if the affinity looks
> wonked when set via procfs?  Just thinking out loud...

I think we should take my patch. While ideally we'd like to have
perfect linear irq based queue mapping, people apparently really like
to modify devices irq affinity. With this, at least some of the queues
are assigned according to the device affinity. Its not ideal, but at
least better than blindly falling to default mapping.

Christoph?

If there is strong resistance to the patch we should at least start
by falling to the default mapping unconditionally.
Steve Wise Oct. 3, 2018, 9:21 p.m. UTC | #9
On 10/3/2018 4:14 PM, Sagi Grimberg wrote:
> 
>> Can we please make forward progress on this?
>>
>> Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
>> shouldn't be allowed if drivers support managed affinity. Is that
>> correct?
>>
>> Perhaps that can be codified and be a way forward?  IE: Somehow allow
>> the admin to choose either "managed by the driver/ulps" or "managed by
>> the system admin directly"?
>>
>> Or just use Sagi's patch.  Perhaps a WARN_ONCE() if the affinity looks
>> wonked when set via procfs?  Just thinking out loud...
> 
> I think we should take my patch. While ideally we'd like to have
> perfect linear irq based queue mapping, people apparently really like
> to modify devices irq affinity. With this, at least some of the queues
> are assigned according to the device affinity. Its not ideal, but at
> least better than blindly falling to default mapping.
> 
> Christoph?
> 
> If there is strong resistance to the patch we should at least start
> by falling to the default mapping unconditionally.

Perhaps we should still add a WARN_ONCE if we don't get a linear
mapping?  IE admins are now tweaking this because in the past the
default affinity could hurt performance, and adjusting it via procfs was
the answer.  But now we're making the devices and ULPs smarter, so we
should maybe let the admin know when we think the ideal affinity is not
being achieved.

Steve.
Sagi Grimberg Oct. 16, 2018, 1:04 a.m. UTC | #10
>> Can we please make forward progress on this?
>>
>> Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
>> shouldn't be allowed if drivers support managed affinity. Is that 
>> correct?
>>
>> Perhaps that can be codified and be a way forward?  IE: Somehow allow
>> the admin to choose either "managed by the driver/ulps" or "managed by
>> the system admin directly"?
>>
>> Or just use Sagi's patch.  Perhaps a WARN_ONCE() if the affinity looks
>> wonked when set via procfs?  Just thinking out loud...
> 
> I think we should take my patch. While ideally we'd like to have
> perfect linear irq based queue mapping, people apparently really like
> to modify devices irq affinity. With this, at least some of the queues
> are assigned according to the device affinity. Its not ideal, but at
> least better than blindly falling to default mapping.
> 
> Christoph?
> 
> If there is strong resistance to the patch we should at least start
> by falling to the default mapping unconditionally.

Ping?

Would you prefer we fallback to the naive mapping if the device is not
using manged affinity?
Christoph Hellwig Oct. 17, 2018, 4:37 p.m. UTC | #11
On Wed, Oct 03, 2018 at 02:05:16PM -0500, Steve Wise wrote:
> Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
> shouldn't be allowed if drivers support managed affinity. Is that correct?

Not just shouldn't, but simply can't.

> But as it stands, things are just plain borked if an rdma driver
> supports ib_get_vector_affinity() yet the admin changes the affinity via
> /proc...

I think we need to fix ib_get_vector_affinity to not return anything
if the device doesn't use managed irq affinity.
Christoph Hellwig Oct. 17, 2018, 4:37 p.m. UTC | #12
On Mon, Oct 15, 2018 at 06:04:46PM -0700, Sagi Grimberg wrote:
> Would you prefer we fallback to the naive mapping if the device is not
> using manged affinity?

Yes.
Sagi Grimberg Oct. 23, 2018, 6:02 a.m. UTC | #13
>> Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
>> shouldn't be allowed if drivers support managed affinity. Is that correct?
> 
> Not just shouldn't, but simply can't.
> 
>> But as it stands, things are just plain borked if an rdma driver
>> supports ib_get_vector_affinity() yet the admin changes the affinity via
>> /proc...
> 
> I think we need to fix ib_get_vector_affinity to not return anything
> if the device doesn't use managed irq affinity.

Steve, does iw_cxgb4 use managed affinity?

I'll send a patch for mlx5 to simply not return anything as managed
affinity is not something that the maintainers want to do.
Steve Wise Oct. 23, 2018, 1 p.m. UTC | #14
> >> Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
> >> shouldn't be allowed if drivers support managed affinity. Is that correct?
> >
> > Not just shouldn't, but simply can't.
> >
> >> But as it stands, things are just plain borked if an rdma driver
> >> supports ib_get_vector_affinity() yet the admin changes the affinity via
> >> /proc...
> >
> > I think we need to fix ib_get_vector_affinity to not return anything
> > if the device doesn't use managed irq affinity.
> 
> Steve, does iw_cxgb4 use managed affinity?
> 
> I'll send a patch for mlx5 to simply not return anything as managed
> affinity is not something that the maintainers want to do.

I'm beginning to think I don't know what "managed affinity" actually is.  Currently iw_cxgb4 doesn't support ib_get_vector_affinity().  I have a patch for it, but ran into this whole issue with nvme failing if someone changes the affinity map via /proc.

Steve.
Sagi Grimberg Oct. 23, 2018, 9:25 p.m. UTC | #15
>>>> Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
>>>> shouldn't be allowed if drivers support managed affinity. Is that correct?
>>>
>>> Not just shouldn't, but simply can't.
>>>
>>>> But as it stands, things are just plain borked if an rdma driver
>>>> supports ib_get_vector_affinity() yet the admin changes the affinity via
>>>> /proc...
>>>
>>> I think we need to fix ib_get_vector_affinity to not return anything
>>> if the device doesn't use managed irq affinity.
>>
>> Steve, does iw_cxgb4 use managed affinity?
>>
>> I'll send a patch for mlx5 to simply not return anything as managed
>> affinity is not something that the maintainers want to do.
> 
> I'm beginning to think I don't know what "managed affinity" actually is.  Currently iw_cxgb4 doesn't support ib_get_vector_affinity().  I have a patch for it, but ran into this whole issue with nvme failing if someone changes the affinity map via /proc.

That means that the pci subsystem gets your vector(s) affinity right and
immutable. It also guarantees that you have reserved vectors and not get
a best effort assignment when cpu cores are offlined.

You can simply enable it by adding PCI_IRQ_AFFINITY to
pci_alloc_irq_vectors() or call pci_alloc_irq_vectors_affinity()
to communicate post/pre vectors that don't participate in
affinitization (nvme uses it for admin queue).

This way you can easily plug ->get_vector_affinity() to return
pci_irq_get_affinity(dev, vector)

The original patch set from hch:
https://lwn.net/Articles/693653/
Steve Wise Oct. 23, 2018, 9:31 p.m. UTC | #16
> -----Original Message-----
> From: Sagi Grimberg <sagi@grimberg.me>
> Sent: Tuesday, October 23, 2018 4:25 PM
> To: Steve Wise <swise@opengridcomputing.com>; 'Christoph Hellwig'
> <hch@lst.de>
> Cc: linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; linux-
> nvme@lists.infradead.org; 'Max Gurtovoy' <maxg@mellanox.com>
> Subject: Re: [PATCH v2] block: fix rdma queue mapping
> 
> 
> >>>> Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
> >>>> shouldn't be allowed if drivers support managed affinity. Is that
> correct?
> >>>
> >>> Not just shouldn't, but simply can't.
> >>>
> >>>> But as it stands, things are just plain borked if an rdma driver
> >>>> supports ib_get_vector_affinity() yet the admin changes the affinity via
> >>>> /proc...
> >>>
> >>> I think we need to fix ib_get_vector_affinity to not return anything
> >>> if the device doesn't use managed irq affinity.
> >>
> >> Steve, does iw_cxgb4 use managed affinity?
> >>
> >> I'll send a patch for mlx5 to simply not return anything as managed
> >> affinity is not something that the maintainers want to do.
> >
> > I'm beginning to think I don't know what "managed affinity" actually is.
> Currently iw_cxgb4 doesn't support ib_get_vector_affinity().  I have a patch
> for it, but ran into this whole issue with nvme failing if someone changes the
> affinity map via /proc.
> 
> That means that the pci subsystem gets your vector(s) affinity right and
> immutable. It also guarantees that you have reserved vectors and not get
> a best effort assignment when cpu cores are offlined.
> 
> You can simply enable it by adding PCI_IRQ_AFFINITY to
> pci_alloc_irq_vectors() or call pci_alloc_irq_vectors_affinity()
> to communicate post/pre vectors that don't participate in
> affinitization (nvme uses it for admin queue).
> 
> This way you can easily plug ->get_vector_affinity() to return
> pci_irq_get_affinity(dev, vector)
> 
> The original patch set from hch:
> https://lwn.net/Articles/693653/

Thanks for educating me. 
Saleem, Shiraz Oct. 24, 2018, 12:09 a.m. UTC | #17
On Tue, Oct 23, 2018 at 03:25:06PM -0600, Sagi Grimberg wrote:
> 
> >>>> Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
> >>>> shouldn't be allowed if drivers support managed affinity. Is that correct?
> >>>
> >>> Not just shouldn't, but simply can't.
> >>>
> >>>> But as it stands, things are just plain borked if an rdma driver
> >>>> supports ib_get_vector_affinity() yet the admin changes the affinity via
> >>>> /proc...
> >>>
> >>> I think we need to fix ib_get_vector_affinity to not return anything
> >>> if the device doesn't use managed irq affinity.
> >>
> >> Steve, does iw_cxgb4 use managed affinity?
> >>
> >> I'll send a patch for mlx5 to simply not return anything as managed
> >> affinity is not something that the maintainers want to do.
> > 
> > I'm beginning to think I don't know what "managed affinity" actually is.  Currently iw_cxgb4 doesn't support ib_get_vector_affinity().  I have a patch for it, but ran into this whole issue with nvme failing if someone changes the affinity map via /proc.
> 
> That means that the pci subsystem gets your vector(s) affinity right and
> immutable. It also guarantees that you have reserved vectors and not get
> a best effort assignment when cpu cores are offlined.
> 
> You can simply enable it by adding PCI_IRQ_AFFINITY to
> pci_alloc_irq_vectors() or call pci_alloc_irq_vectors_affinity()
> to communicate post/pre vectors that don't participate in
> affinitization (nvme uses it for admin queue).
> 
> This way you can easily plug ->get_vector_affinity() to return
> pci_irq_get_affinity(dev, vector)
> 
> The original patch set from hch:
> https://lwn.net/Articles/693653/

Sagi - From what I can tell, i40iw is also exposed to this same issue if the IRQ affinity
is configured by user.
Sagi Grimberg Oct. 24, 2018, 12:37 a.m. UTC | #18
> Sagi - From what I can tell, i40iw is also exposed to this same issue if the IRQ affinity
> is configured by user.

managed affinity does not allow setting smp_affinity from userspace.
Saleem, Shiraz Oct. 29, 2018, 11:58 p.m. UTC | #19
>Subject: Re: [PATCH v2] block: fix rdma queue mapping
>
>
>> Sagi - From what I can tell, i40iw is also exposed to this same issue
>> if the IRQ affinity is configured by user.
>
>managed affinity does not allow setting smp_affinity from userspace.

OK. But our device IRQs are not set-up as to be managed affinity based and can be tuned by user.
And it seems that is reasonable approach since we can never tell ahead of time what might be an optimal config. for a particular workload.

Shiraz
Sagi Grimberg Oct. 30, 2018, 6:26 p.m. UTC | #20
>> Subject: Re: [PATCH v2] block: fix rdma queue mapping
>>
>>
>>> Sagi - From what I can tell, i40iw is also exposed to this same issue
>>> if the IRQ affinity is configured by user.
>>
>> managed affinity does not allow setting smp_affinity from userspace.
> 
> OK. But our device IRQs are not set-up as to be managed affinity based and can be tuned by user.
> And it seems that is reasonable approach since we can never tell ahead of time what might be an optimal config. for a particular workload.

I understand, but its not necessarily the case, the vast majority of
users don't touch (or at least don't want to) and managed affinity gives
you benefits of linear assignment good practice (among others).

The same argument can be made for a number of pcie devices that do use
managed affinity, but the choice was to get it right from the start.
Still not sure why (r)nics are different with that respect.

Anyways, I was just asking, wasn't telling you to go and use it..
diff mbox series

Patch

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f15842..34811db8cba9 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,30 +30,35 @@  static int get_first_sibling(unsigned int cpu)
 	return cpu;
 }
 
-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+void blk_mq_map_queue_cpu(struct blk_mq_tag_set *set, unsigned int cpu)
 {
 	unsigned int *map = set->mq_map;
 	unsigned int nr_queues = set->nr_hw_queues;
-	unsigned int cpu, first_sibling;
+	unsigned int first_sibling;
 
-	for_each_possible_cpu(cpu) {
-		/*
-		 * First do sequential mapping between CPUs and queues.
-		 * In case we still have CPUs to map, and we have some number of
-		 * threads per cores then map sibling threads to the same queue for
-		 * performace optimizations.
-		 */
-		if (cpu < nr_queues) {
+	/*
+	 * First do sequential mapping between CPUs and queues.
+	 * In case we still have CPUs to map, and we have some number of
+	 * threads per cores then map sibling threads to the same queue for
+	 * performace optimizations.
+	 */
+	if (cpu < nr_queues) {
+		map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+	} else {
+		first_sibling = get_first_sibling(cpu);
+		if (first_sibling == cpu)
 			map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-		} else {
-			first_sibling = get_first_sibling(cpu);
-			if (first_sibling == cpu)
-				map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-			else
-				map[cpu] = map[first_sibling];
-		}
+		else
+			map[cpu] = map[first_sibling];
 	}
+}
+
+int blk_mq_map_queues(struct blk_mq_tag_set *set)
+{
+	unsigned int cpu;
 
+	for_each_possible_cpu(cpu)
+		blk_mq_map_queue_cpu(set, cpu);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(blk_mq_map_queues);
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..3bce60cd5bcf 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -14,6 +14,61 @@ 
 #include <linux/blk-mq-rdma.h>
 #include <rdma/ib_verbs.h>
 
+static int blk_mq_rdma_map_queue(struct blk_mq_tag_set *set,
+		struct ib_device *dev, int first_vec, unsigned int queue)
+{
+	const struct cpumask *mask;
+	unsigned int cpu;
+	bool mapped = false;
+
+	mask = ib_get_vector_affinity(dev, first_vec + queue);
+	if (!mask)
+		return -ENOTSUPP;
+
+	/* map with an unmapped cpu according to affinity mask */
+	for_each_cpu(cpu, mask) {
+		if (set->mq_map[cpu] == UINT_MAX) {
+			set->mq_map[cpu] = queue;
+			mapped = true;
+			break;
+		}
+	}
+
+	if (!mapped) {
+		int n;
+
+		/* map with an unmapped cpu in the same numa node */
+		for_each_node(n) {
+			const struct cpumask *node_cpumask = cpumask_of_node(n);
+
+			if (!cpumask_intersects(mask, node_cpumask))
+				continue;
+
+			for_each_cpu(cpu, node_cpumask) {
+				if (set->mq_map[cpu] == UINT_MAX) {
+					set->mq_map[cpu] = queue;
+					mapped = true;
+					break;
+				}
+			}
+		}
+	}
+
+	if (!mapped) {
+		/* map with any unmapped cpu we can find */
+		for_each_possible_cpu(cpu) {
+			if (set->mq_map[cpu] == UINT_MAX) {
+				set->mq_map[cpu] = queue;
+				mapped = true;
+				break;
+			}
+		}
+	}
+
+	WARN_ON_ONCE(!mapped);
+	return 0;
+}
+
 /**
  * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
  * @set:	tagset to provide the mapping for
@@ -21,31 +76,36 @@ 
  * @first_vec:	first interrupt vectors to use for queues (usually 0)
  *
  * This function assumes the rdma device @dev has at least as many available
- * interrupt vetors as @set has queues.  It will then query it's affinity mask
- * and built queue mapping that maps a queue to the CPUs that have irq affinity
- * for the corresponding vector.
+ * interrupt vetors as @set has queues.  It will then query vector affinity mask
+ * and attempt to build irq affinity aware queue mappings. If optimal affinity
+ * aware mapping cannot be acheived for a given queue, we look for any unmapped
+ * cpu to map it. Lastly, we map naively all other unmapped cpus in the mq_map.
  *
  * In case either the driver passed a @dev with less vectors than
  * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
  * vector, we fallback to the naive mapping.
  */
 int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
-		struct ib_device *dev, int first_vec)
+                struct ib_device *dev, int first_vec)
 {
-	const struct cpumask *mask;
 	unsigned int queue, cpu;
 
+	/* reset cpu mapping */
+	for_each_possible_cpu(cpu)
+		set->mq_map[cpu] = UINT_MAX;
+
 	for (queue = 0; queue < set->nr_hw_queues; queue++) {
-		mask = ib_get_vector_affinity(dev, first_vec + queue);
-		if (!mask)
+		if (blk_mq_rdma_map_queue(set, dev, first_vec, queue))
 			goto fallback;
+	}
 
-		for_each_cpu(cpu, mask)
-			set->mq_map[cpu] = queue;
+	/* map any remaining unmapped cpus */
+	for_each_possible_cpu(cpu) {
+		if (set->mq_map[cpu] == UINT_MAX)
+			blk_mq_map_queue_cpu(set, cpu);
 	}
 
 	return 0;
-
 fallback:
 	return blk_mq_map_queues(set);
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d710e92874cc..6eb09c4de34f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -285,6 +285,7 @@  int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
 
 int blk_mq_map_queues(struct blk_mq_tag_set *set);
+void blk_mq_map_queue_cpu(struct blk_mq_tag_set *set, unsigned int cpu);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);