mbox series

[V3,00/10] mpt3sas: Aero/Sea HBA feature addition

Message ID 20190531121443.30694-1-suganath-prabu.subramani@broadcom.com (mailing list archive)
Headers show
Series mpt3sas: Aero/Sea HBA feature addition | expand

Message

Suganath Prabu S May 31, 2019, 12:14 p.m. UTC
In this patch series we are adding below two features for
Aero/Sea HBA device. Aero/Sea series HBA is PCI4.0 based controllers.

1. Add Atomic Request descriptor support:
Atomic Request Descriptor is an alternative method for posting
an entry onto a request queue. The posting of an Atomic
Request Descriptor is an atomic operation.

2. Max Performance using balanced performance mode. 
(~3.0 Million IOPs for Random Read/Write):

Aero Hardware interrupt coalescing behavior:

H/W Interrupt coalescing is effective within a group of reply queue.
Each reply queue group consists of 8 reply queues. This particular
h/w interface allows to configure mixture of interrupt coalescing
on/off per controller. In 128 vector mode, every 8 MSI-x vectors will be
grouped together. If a reply is sent to a MSI-x vector not in the
present group, h/w will flush all outstanding IOS.

Below are some examples to understand h/w interrupt coalescing behavior.
Let’s consider some simple io submission and completion flow to
understand h/w interrupt coalescing behavior.
Reply queue starts with index 0. Reply return from back end device
in the same order as it was submitted (this just to explain
simple case). Reply queue 0-7 is one interrupt coalescing group.
Another interrupt coalescing group is 8-15. Let’s consider that
CoalescingTimeout is max value just to explain h/w behavior.
CoalescingDepth = 0x4 and CoalescingTimeout = 0xFFFF.

Coalescing timeout:
Reply coalescing timeout in units of one-half of a microsecond
(500 nanoseconds). When the amount of time from the first pending
reply exceeds this value, the IOC sends the replies to the host
using a single interrupt.

Coalescing Depth:
This field specifies the reply coalescing depth. When the
number of pending replies exceeds this number, the IOC sends the
replies to the host using a single interrupt.

1. Driver has submitted 4 IOS to firmware for targeted reply queue groups
4, 5, 6, and 7. In this case, OS will receive 4 interrupt for 4
replies. Interrupt on reply queue group 4, 5, 6 is due to next
io completion is on different msix vector group. Interrupt on reply
queue 7 is due to coalescing timeout criteria (as their are no IOs
after this).

2. Driver has submitted 4 IOS to firmware for targeted reply
queues group 4, 4, 6, and 6. In this case, OS will receive 2 interrupt
for 4 replies. First interrupt on reply queue group 4 will be before
coalescing timeout elapsed. Since next io completion is on a
different msix vector group, interrupt on reply queue group 6 will be
after coalescing timeout elapsed.

3. Driver has submitted 6 IOS to firmware for targeted reply queue
group 4, 4, 4, 4, 4 and 5. In this case, OS will receive 3 interrupt for 6
replies. First Interrupt on reply queue 4 will be before coalescing
timeout elapsed due to coalesced depth criteria meet. Second Interrupt
on reply queue 4 will be before coalescing timeout elapsed, since next
io completion is on different msix vector. Third interrupt on reply
queue group 5 will be after coalescing timeout elapsed.

Balanced performance interface for Aero/Sea:

Driver will use combination of interrupt coalescing and no
interrupt coalescing in h/w. Driver will create few high
iops queue to get high IOPs on Aero/Sea family controllers.
Low latency queues are queues without interrupt coalsecing.
High iops queue are special queue which has interrupt coalescing ON
(e.a 0x4 Coalescing depth and 0x20 Coalescing Timeout). In general,
high iops queue and low latency queue together should fit into 128
reply queue (max reply queue supported by Aero/Sea).
Driver should pick only few high iops queue because it should not
change low latency io path as much as possible and at the same time
few high iops queue should be good enough to get the high iops numbers.

High level design of creating number of high iops queues:
If there are unused reply queue left in the system due to logical cpu
count is less than firmware supported msix vectors, driver should increase
effective reply queues.

Example: 2 socket server (total 72 logical CPU).
There are total 56 reply queue unmapped in that system.

Driver should map low latency reply queue starting from 8 to 79
and should map high iops reply queue starting from 0 to 7.

If #online cpus are more than or equal to 120 (derives from #total msix
supported by firmware minus high iops queue), allocate 128 reply queues.
Driver should map low latency reply queue starting from 8 to 127.
Driver should map high iops reply queue starting from 0 to 7.
If driver is not able to allocate required reply queues, it should fall
back to legacy mode (without high iops queue)

IO submission (Below is applicable only if high iops queue is enabled):

If outstanding IOs per scsi device is less than or equal to 8,
use legacy io path (i.e use low latency reply queue).
If outstanding IOs per scsi device is more than  8 -
Driver should do round robin io submission in batches on high iops queue.

Example: Batches of the 16.
“First 16 IOS submitted to reply queue 0, next 16 IOS submitted to reply
queue 1 etc”.

v2:
- Fix sparse warnings reported by kbuild test robot

v3:
- In patch6, For readability use if statement instead
of ternary operator.

Suganath Prabu S (10):
  mpt3sas: function pointers of request descriptor
  mpt3sas: Add Atomic RequestDescriptor support on Aero
  mpt3sas: Add flag high_iops_queues
  mpt3sas: change _base_get_msix_index prototype
  mpt3sas: Use highiops queues if more in-flights
  mpt3sas:save msix index and use same while posting RD
  mpt3sas: Affinity high iops queues IRQs to local node
  mpt3sas: Enable interrupt coalescing on high iops
  mpt3sas: Introduce perf_mode module parameter
  mpt3sas: Update driver version to 29.100.00.00

 drivers/scsi/mpt3sas/mpi/mpi2_cnfg.h     |   2 +-
 drivers/scsi/mpt3sas/mpt3sas_base.c      | 476 ++++++++++++++++++++---
 drivers/scsi/mpt3sas/mpt3sas_base.h      |  34 +-
 drivers/scsi/mpt3sas/mpt3sas_config.c    |  73 +++-
 drivers/scsi/mpt3sas/mpt3sas_ctl.c       |  20 +-
 drivers/scsi/mpt3sas/mpt3sas_scsih.c     |  38 +-
 drivers/scsi/mpt3sas/mpt3sas_transport.c |   8 +-
 7 files changed, 574 insertions(+), 77 deletions(-)

Comments

Martin K. Petersen June 7, 2019, 2:30 p.m. UTC | #1
Suganath,

I applied this series to 5.3/scsi-queue.

However, I remain unconvinced of the merits of the config page
putback. Why even bother if a controller reset causes the defaults to be
loaded from NVRAM?

Also, triggering on X86 for selecting performance mode seems
questionable. I would like to see a follow-on patch that comes up with a
better heuristic.
Kashyap Desai June 7, 2019, 3:40 p.m. UTC | #2
>
> Suganath,
>
> I applied this series to 5.3/scsi-queue.
>
> However, I remain unconvinced of the merits of the config page putback.
Why
> even bother if a controller reset causes the defaults to be loaded from
> NVRAM?
>
> Also, triggering on X86 for selecting performance mode seems
questionable. I
> would like to see a follow-on patch that comes up with a better
heuristic.

Martin -

AMD EPYC is not efficient w.r.t QPI transaction.  I tested performance on
AMD EPYC 7601 Chipset. It has totally 128 logical CPU.
Aero/Sea controller support at max 128 MSIx vector. In good case scenario,
we will have 1:1 CPU to MSIX mapping.  I can get 2.4 M IOPS in this case.

Just to simulate performance issue, I reduce controller msix vector to 64.
It means cpu to msix mapping is 2:1. Indirectly, I am trying to generate
completion which requires completion on remote cpu (via
call_function_single_interrupt).
In this case, I can get 1.7M IOPS.

Same test on Intel architecture provides better result (Negligible
performance impact).  This patch set maps high iops queues (queues with
interrupt coalescing turned on) to local numa node.
High iops queue count is limited and it depends upon QPI for io
completion.  We have enable this feature only for intel arch where we have
seen improvement.  Not having this feature is not bad, but if we enable
this feature we may get negative impact if QPI overhead (like AMD) is
high.

Kashyap

>
> --
> Martin K. Petersen	Oracle Linux Engineering
Martin K. Petersen June 7, 2019, 4:49 p.m. UTC | #3
Kashyap,

> AMD EPYC is not efficient w.r.t QPI transaction.
[...]
> Same test on Intel architecture provides better result

Heuristics are always hard.

However, you are making assumptions based on observed performance of
current Intel offerings vs. current AMD offerings. This results in what
is inevitably going to be a short-lived heuristic in the kernel. Things
could easily be reversed in next generation platforms from these
vendors.

So while I appreciate that the logic works given the machines you are
currently testing, I think CPU manufacturer is a horrible heuristic. You
are stating "This will be the right choice for all future processors
manufactured by Intel". That's a bit of a leap of faith.

Instead of predicting the future I prefer to make decisions based on
things we know. Measured negative impact on current EPYC family, for
instance. That's a fairly well-defined and narrow scope.

That said, I am still not a big fan of platform-specific tweaks in
drivers. While I prefer the kernel to do the right thing out of the box,
I think the module parameter is probably the better choice in this case.
Kashyap Desai June 18, 2019, 4:45 p.m. UTC | #4
On Fri, Jun 7, 2019 at 10:19 PM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
>
> Kashyap,
>
> > AMD EPYC is not efficient w.r.t QPI transaction.
> [...]
> > Same test on Intel architecture provides better result
>
> Heuristics are always hard.
>
> However, you are making assumptions based on observed performance of
> current Intel offerings vs. current AMD offerings. This results in what
> is inevitably going to be a short-lived heuristic in the kernel. Things
> could easily be reversed in next generation platforms from these
> vendors.
>
> So while I appreciate that the logic works given the machines you are
> currently testing, I think CPU manufacturer is a horrible heuristic. You
> are stating "This will be the right choice for all future processors
> manufactured by Intel". That's a bit of a leap of faith.
>
> Instead of predicting the future I prefer to make decisions based on
> things we know. Measured negative impact on current EPYC family, for
> instance. That's a fairly well-defined and narrow scope.
>
> That said, I am still not a big fan of platform-specific tweaks in
> drivers. While I prefer the kernel to do the right thing out of the box,
> I think the module parameter is probably the better choice in this case.

Martin,
If we decide to remove cpu arch check later, things will be
unnecessary complex to explain default driver behavior as we may have
two driver behaviors.
We are going to remove cpu architecture detection logic. It is good to
have module parameter based dependency from day one.
We will be sending relevant patch soon.

Kashyap


>
> --
> Martin K. Petersen      Oracle Linux Engineering