mbox series

[0/9] Support using MSI interrupts in ntb_transport

Message ID 20190131185656.17972-1-logang@deltatee.com (mailing list archive)
Headers show
Series Support using MSI interrupts in ntb_transport | expand

Message

Logan Gunthorpe Jan. 31, 2019, 6:56 p.m. UTC
Hi,

This patch series adds optional support for using MSI interrupts instead
of NTB doorbells in ntb_transport. This is desirable seeing doorbells on
current hardware are quite slow and therefore switching to MSI interrupts
provides a significant performance gain. On switchtec hardware, a simple
apples-to-apples comparison shows ntb_netdev/iperf numbers going from
3.88Gb/s to 14.1Gb/s when switching to MSI interrupts.

To do this, a couple changes are required outside of the NTB tree:

1) The IOMMU must know to accept MSI requests from aliased bused numbers
seeing NTB hardware typically sends proxied request IDs through
additional requester IDs. The first patch in this series adds support
for the Intel IOMMU. A quirk to add these aliases for switchtec hardware
was already accepted. See commit ad281ecf1c7d ("PCI: Add DMA alias quirk
for Microsemi Switchtec NTB") for a description of NTB proxy IDs and why
this is necessary.

2) NTB transport (and other clients) may often need more MSI interrupts
than the NTB hardware actually advertises support for. However, seeing
these interrupts will not be triggered by the hardware but through an
NTB memory window, the hardware does not actually need support or need
to know about them. Therefore we add the concept of Virtual MSI
interrupts which are allocated just like any other MSI interrupt but
are not programmed into the hardware's MSI table. This is done in
Patch 2 and then made use of in Patch 3.

The remaining patches in this series add a library for dealing with MSI
interrupts, a test client and finally support in ntb_transport.

The series is based off of v5.0-rc4 and I've tested it on top of a
of the patches I've already sent to the NTB tree (though they are
independent changes). A git repo is available here:

https://github.com/sbates130272/linux-p2pmem/ ntb_transport_msi_v1

Thanks,

Logan

--

Logan Gunthorpe (9):
  iommu/vt-d: Allow interrupts from the entire bus for aliased devices
  PCI/MSI: Support allocating virtual MSI interrupts
  PCI/switchtec: Add module parameter to request more interrupts
  NTB: Introduce functions to calculate multi-port resource index
  NTB: Rename ntb.c to support multiple source files in the module
  NTB: Introduce MSI library
  NTB: Introduce NTB MSI Test Client
  NTB: Add ntb_msi_test support to ntb_test
  NTB: Add MSI interrupt support to ntb_transport

 drivers/iommu/intel_irq_remapping.c     |  12 +
 drivers/ntb/Kconfig                     |  10 +
 drivers/ntb/Makefile                    |   3 +
 drivers/ntb/{ntb.c => core.c}           |   0
 drivers/ntb/msi.c                       | 313 ++++++++++++++++++
 drivers/ntb/ntb_transport.c             | 134 +++++++-
 drivers/ntb/test/Kconfig                |   9 +
 drivers/ntb/test/Makefile               |   1 +
 drivers/ntb/test/ntb_msi_test.c         | 416 ++++++++++++++++++++++++
 drivers/pci/msi.c                       |  51 ++-
 drivers/pci/switch/switchtec.c          |  12 +-
 include/linux/msi.h                     |   1 +
 include/linux/ntb.h                     | 139 ++++++++
 include/linux/pci.h                     |   9 +
 tools/testing/selftests/ntb/ntb_test.sh |  54 ++-
 15 files changed, 1150 insertions(+), 14 deletions(-)
 rename drivers/ntb/{ntb.c => core.c} (100%)
 create mode 100644 drivers/ntb/msi.c
 create mode 100644 drivers/ntb/test/ntb_msi_test.c

--
2.19.0

Comments

Dave Jiang Jan. 31, 2019, 8:20 p.m. UTC | #1
On 1/31/2019 11:56 AM, Logan Gunthorpe wrote:
> Hi,
>
> This patch series adds optional support for using MSI interrupts instead
> of NTB doorbells in ntb_transport. This is desirable seeing doorbells on
> current hardware are quite slow and therefore switching to MSI interrupts
> provides a significant performance gain. On switchtec hardware, a simple
> apples-to-apples comparison shows ntb_netdev/iperf numbers going from
> 3.88Gb/s to 14.1Gb/s when switching to MSI interrupts.
>
> To do this, a couple changes are required outside of the NTB tree:
>
> 1) The IOMMU must know to accept MSI requests from aliased bused numbers
> seeing NTB hardware typically sends proxied request IDs through
> additional requester IDs. The first patch in this series adds support
> for the Intel IOMMU. A quirk to add these aliases for switchtec hardware
> was already accepted. See commit ad281ecf1c7d ("PCI: Add DMA alias quirk
> for Microsemi Switchtec NTB") for a description of NTB proxy IDs and why
> this is necessary.
>
> 2) NTB transport (and other clients) may often need more MSI interrupts
> than the NTB hardware actually advertises support for. However, seeing
> these interrupts will not be triggered by the hardware but through an
> NTB memory window, the hardware does not actually need support or need
> to know about them. Therefore we add the concept of Virtual MSI
> interrupts which are allocated just like any other MSI interrupt but
> are not programmed into the hardware's MSI table. This is done in
> Patch 2 and then made use of in Patch 3.

Logan,

Does this work when the system moves the MSI vector either via software 
(irqbalance) or BIOS APIC programming (some modes cause round robin 
behavior)?



>
> The remaining patches in this series add a library for dealing with MSI
> interrupts, a test client and finally support in ntb_transport.
>
> The series is based off of v5.0-rc4 and I've tested it on top of a
> of the patches I've already sent to the NTB tree (though they are
> independent changes). A git repo is available here:
>
> https://github.com/sbates130272/linux-p2pmem/ ntb_transport_msi_v1
>
> Thanks,
>
> Logan
>
> --
>
> Logan Gunthorpe (9):
>    iommu/vt-d: Allow interrupts from the entire bus for aliased devices
>    PCI/MSI: Support allocating virtual MSI interrupts
>    PCI/switchtec: Add module parameter to request more interrupts
>    NTB: Introduce functions to calculate multi-port resource index
>    NTB: Rename ntb.c to support multiple source files in the module
>    NTB: Introduce MSI library
>    NTB: Introduce NTB MSI Test Client
>    NTB: Add ntb_msi_test support to ntb_test
>    NTB: Add MSI interrupt support to ntb_transport
>
>   drivers/iommu/intel_irq_remapping.c     |  12 +
>   drivers/ntb/Kconfig                     |  10 +
>   drivers/ntb/Makefile                    |   3 +
>   drivers/ntb/{ntb.c => core.c}           |   0
>   drivers/ntb/msi.c                       | 313 ++++++++++++++++++
>   drivers/ntb/ntb_transport.c             | 134 +++++++-
>   drivers/ntb/test/Kconfig                |   9 +
>   drivers/ntb/test/Makefile               |   1 +
>   drivers/ntb/test/ntb_msi_test.c         | 416 ++++++++++++++++++++++++
>   drivers/pci/msi.c                       |  51 ++-
>   drivers/pci/switch/switchtec.c          |  12 +-
>   include/linux/msi.h                     |   1 +
>   include/linux/ntb.h                     | 139 ++++++++
>   include/linux/pci.h                     |   9 +
>   tools/testing/selftests/ntb/ntb_test.sh |  54 ++-
>   15 files changed, 1150 insertions(+), 14 deletions(-)
>   rename drivers/ntb/{ntb.c => core.c} (100%)
>   create mode 100644 drivers/ntb/msi.c
>   create mode 100644 drivers/ntb/test/ntb_msi_test.c
>
> --
> 2.19.0
>
Logan Gunthorpe Jan. 31, 2019, 8:48 p.m. UTC | #2
On 2019-01-31 1:20 p.m., Dave Jiang wrote:
> Does this work when the system moves the MSI vector either via software 
> (irqbalance) or BIOS APIC programming (some modes cause round robin 
> behavior)?


I don't know how irqbalance works, and I'm not sure what you are
referring to by BIOS APIC programming, however I would expect these
things would not be a problem.

The MSI code I'm presenting here doesn't do anything crazy with the
interrupts, it allocates and uses them just as any PCI driver would. The
only real difference here is that instead of a piece of hardware sending
the IRQ TLP, it will be sent through the memory window (which, from the
OS's perspective, is just coming from an NTB hardware proxy alias).

Logan
Dave Jiang Jan. 31, 2019, 8:58 p.m. UTC | #3
On 1/31/2019 1:48 PM, Logan Gunthorpe wrote:
>
> On 2019-01-31 1:20 p.m., Dave Jiang wrote:
>> Does this work when the system moves the MSI vector either via software
>> (irqbalance) or BIOS APIC programming (some modes cause round robin
>> behavior)?
>
> I don't know how irqbalance works, and I'm not sure what you are
> referring to by BIOS APIC programming, however I would expect these
> things would not be a problem.
>
> The MSI code I'm presenting here doesn't do anything crazy with the
> interrupts, it allocates and uses them just as any PCI driver would. The
> only real difference here is that instead of a piece of hardware sending
> the IRQ TLP, it will be sent through the memory window (which, from the
> OS's perspective, is just coming from an NTB hardware proxy alias).
>
> Logan
Right. I did that as a hack a while back for some silicon errata 
workaround. When the vector moves, the address for the LAPIC changes. So 
unless it gets updated, you end up writing to the old location and lose 
all the new interrupts. irqbalance is a user daemon that rotates the 
system interrupts around to ensure that not all interrupts are pinned on 
a single core. I think it's enabled by default on several distros. 
Although MSIX has nothing to do with the IOAPIC, the mode that the APIC 
is programmed can have an influence on how the interrupts are delivered. 
There are certain Intel platforms (I don't know if AMD does anything 
like that) puts the IOAPIC in a certain configuration that causes the 
interrupts to be moved in a round robin fashion. I think it's physical 
flat mode? I don't quite recall. Normally on the low end Xeons. It's 
probably worth doing a test run with the irqbalance daemon running and 
make sure you traffic stream doesn't all of sudden stop.
>
Logan Gunthorpe Jan. 31, 2019, 10:39 p.m. UTC | #4
On 2019-01-31 1:58 p.m., Dave Jiang wrote:
> 
> On 1/31/2019 1:48 PM, Logan Gunthorpe wrote:
>>
>> On 2019-01-31 1:20 p.m., Dave Jiang wrote:
>>> Does this work when the system moves the MSI vector either via software
>>> (irqbalance) or BIOS APIC programming (some modes cause round robin
>>> behavior)?
>>
>> I don't know how irqbalance works, and I'm not sure what you are
>> referring to by BIOS APIC programming, however I would expect these
>> things would not be a problem.
>>
>> The MSI code I'm presenting here doesn't do anything crazy with the
>> interrupts, it allocates and uses them just as any PCI driver would. The
>> only real difference here is that instead of a piece of hardware sending
>> the IRQ TLP, it will be sent through the memory window (which, from the
>> OS's perspective, is just coming from an NTB hardware proxy alias).
>>
>> Logan
> Right. I did that as a hack a while back for some silicon errata 
> workaround. When the vector moves, the address for the LAPIC changes. So 
> unless it gets updated, you end up writing to the old location and lose 
> all the new interrupts. irqbalance is a user daemon that rotates the 
> system interrupts around to ensure that not all interrupts are pinned on 
> a single core. 

Yes, that would be a problem if something changes the MSI vectors out
from under us. Seems like that would be a bit difficult to do even with
regular hardware. So far I haven't seen anything that would do that. If
you know of where in the kernel this happens I'd be interested in
getting a pointer to the flow in the code. If that is the case this MSI
stuff will need to get much more complicated...

> I think it's enabled by default on several distros. 

> Although MSIX has nothing to do with the IOAPIC, the mode that the APIC 
> is programmed can have an influence on how the interrupts are delivered. 
> There are certain Intel platforms (I don't know if AMD does anything 
> like that) puts the IOAPIC in a certain configuration that causes the 
> interrupts to be moved in a round robin fashion. I think it's physical 
> flat mode? I don't quite recall. Normally on the low end Xeons. It's 
> probably worth doing a test run with the irqbalance daemon running and 
> make sure you traffic stream doesn't all of sudden stop.

I've tested with irqbalance running and haven't found any noticeable
difference.

Logan
Dave Jiang Jan. 31, 2019, 10:46 p.m. UTC | #5
On 1/31/2019 3:39 PM, Logan Gunthorpe wrote:
>
> On 2019-01-31 1:58 p.m., Dave Jiang wrote:
>> On 1/31/2019 1:48 PM, Logan Gunthorpe wrote:
>>> On 2019-01-31 1:20 p.m., Dave Jiang wrote:
>>>> Does this work when the system moves the MSI vector either via software
>>>> (irqbalance) or BIOS APIC programming (some modes cause round robin
>>>> behavior)?
>>> I don't know how irqbalance works, and I'm not sure what you are
>>> referring to by BIOS APIC programming, however I would expect these
>>> things would not be a problem.
>>>
>>> The MSI code I'm presenting here doesn't do anything crazy with the
>>> interrupts, it allocates and uses them just as any PCI driver would. The
>>> only real difference here is that instead of a piece of hardware sending
>>> the IRQ TLP, it will be sent through the memory window (which, from the
>>> OS's perspective, is just coming from an NTB hardware proxy alias).
>>>
>>> Logan
>> Right. I did that as a hack a while back for some silicon errata
>> workaround. When the vector moves, the address for the LAPIC changes. So
>> unless it gets updated, you end up writing to the old location and lose
>> all the new interrupts. irqbalance is a user daemon that rotates the
>> system interrupts around to ensure that not all interrupts are pinned on
>> a single core.
> Yes, that would be a problem if something changes the MSI vectors out
> from under us. Seems like that would be a bit difficult to do even with
> regular hardware. So far I haven't seen anything that would do that. If
> you know of where in the kernel this happens I'd be interested in
> getting a pointer to the flow in the code. If that is the case this MSI
> stuff will need to get much more complicated...

I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So 
maybe take a look at the code that starts from there and see if it would 
have any impact on your stuff.


>
>> I think it's enabled by default on several distros.
>> Although MSIX has nothing to do with the IOAPIC, the mode that the APIC
>> is programmed can have an influence on how the interrupts are delivered.
>> There are certain Intel platforms (I don't know if AMD does anything
>> like that) puts the IOAPIC in a certain configuration that causes the
>> interrupts to be moved in a round robin fashion. I think it's physical
>> flat mode? I don't quite recall. Normally on the low end Xeons. It's
>> probably worth doing a test run with the irqbalance daemon running and
>> make sure you traffic stream doesn't all of sudden stop.
> I've tested with irqbalance running and haven't found any noticeable
> difference.
>
> Logan
Logan Gunthorpe Jan. 31, 2019, 11:41 p.m. UTC | #6
On 2019-01-31 3:46 p.m., Dave Jiang wrote:
> I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So 
> maybe take a look at the code that starts from there and see if it would 
> have any impact on your stuff.

Ok, well on my system I can write to the smp_affinity all day and the
MSI interrupts still work fine.

The MSI code is a bit difficult to trace and audit with all the
different chips and the parent chips which I don't have a good
understanding of. But I can definitely see that it could be possible for
some chips to change the address as smp_affinitiy will eventually
sometimes call msi_domain_set_affinity() which does seem to recompose
the message and write it back to the chip.

So, I could relatively easily add a callback to msi_desc to catch this
and resend the MSI address/data. However, I'm not sure how this is ever
done atomically. It seems like there would be a race while the device
updates its address where old interrupts could be triggered. This race
would be much longer for us when sending this information over the NTB
link. Though, I guess if the only change is that it encodes CPU
information in the address then that would not be an issue. However, I'm
not sure I can say that for certain without a comprehensive
understanding of all the IRQ chips.

Any thoughts on this?

Logan
Dave Jiang Jan. 31, 2019, 11:48 p.m. UTC | #7
On 1/31/2019 4:41 PM, Logan Gunthorpe wrote:
>
> On 2019-01-31 3:46 p.m., Dave Jiang wrote:
>> I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So
>> maybe take a look at the code that starts from there and see if it would
>> have any impact on your stuff.
> Ok, well on my system I can write to the smp_affinity all day and the
> MSI interrupts still work fine.

Maybe your code is ok then. If the stats show up in /proc/interrupts 
then you can see it moving to different cores.

> The MSI code is a bit difficult to trace and audit with all the
> different chips and the parent chips which I don't have a good
> understanding of. But I can definitely see that it could be possible for
> some chips to change the address as smp_affinitiy will eventually
> sometimes call msi_domain_set_affinity() which does seem to recompose
> the message and write it back to the chip.
>
> So, I could relatively easily add a callback to msi_desc to catch this
> and resend the MSI address/data. However, I'm not sure how this is ever
> done atomically. It seems like there would be a race while the device
> updates its address where old interrupts could be triggered. This race
> would be much longer for us when sending this information over the NTB
> link. Though, I guess if the only change is that it encodes CPU
> information in the address then that would not be an issue. However, I'm
> not sure I can say that for certain without a comprehensive
> understanding of all the IRQ chips.
>
> Any thoughts on this?

Yeah I'm not sure what to do about it either as I'm not super familiar 
with that area either. Just making note of what I encountered. And you 
are right, the updated info has to go over NTB for the other side to 
write to the updated place. So there's a lot of latency involved.



>
> Logan
Logan Gunthorpe Jan. 31, 2019, 11:52 p.m. UTC | #8
On 2019-01-31 4:48 p.m., Dave Jiang wrote:
> 
> On 1/31/2019 4:41 PM, Logan Gunthorpe wrote:
>>
>> On 2019-01-31 3:46 p.m., Dave Jiang wrote:
>>> I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So
>>> maybe take a look at the code that starts from there and see if it would
>>> have any impact on your stuff.
>> Ok, well on my system I can write to the smp_affinity all day and the
>> MSI interrupts still work fine.
> 
> Maybe your code is ok then. If the stats show up in /proc/interrupts 
> then you can see it moving to different cores.

Yes, I did check that the stats change CPU in proc interrupts.

> Yeah I'm not sure what to do about it either as I'm not super familiar 
> with that area either. Just making note of what I encountered. And you 
> are right, the updated info has to go over NTB for the other side to 
> write to the updated place. So there's a lot of latency involved.

Ok, well I'll implement the callback anyway for v2. Better safe than
sorry. We can operate on the assumption that someone thought of the race
condition and if we ever see reports of lost interrupts we'll know where
to look.

Logan