Message ID | 20240126234237.547278-1-jacob.jun.pan@linux.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Coalesced Interrupt Delivery with posted MSI | expand |
Hi Jacob, I gave this a quick spin, using 4 gen2 optane drives. Basic test, just IOPS bound on the drive, and using 1 thread per drive for IO. Random reads, using io_uring. For reference, using polled IO: IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31 which is abount 5.1M/drive, which is what they can deliver. Before your patches, I see: IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 at 2.82M ints/sec. With the patches, I see: IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31 IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31 IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32 at 2.34M ints/sec. So a nice reduction in interrupt rate, though not quite at the extent I expected. Booted with 'posted_msi' and I do see posted interrupts increasing in the PMN in /proc/interrupts, Probably want to fold this one in: diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 8e09d40ea928..a289282f1cf9 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -393,7 +393,7 @@ void intel_posted_msi_init(void) * instead of: * read, xchg, read, xchg, read, xchg, read, xchg */ -static __always_inline inline bool handle_pending_pir(u64 *pir, struct pt_regs *regs) +static __always_inline bool handle_pending_pir(u64 *pir, struct pt_regs *regs) { int i, vec = FIRST_EXTERNAL_VECTOR; unsigned long pir_copy[4];
Hi Jens, On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote: > Hi Jacob, > > I gave this a quick spin, using 4 gen2 optane drives. Basic test, just > IOPS bound on the drive, and using 1 thread per drive for IO. Random > reads, using io_uring. > > For reference, using polled IO: > > IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 > IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 > IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31 > > which is abount 5.1M/drive, which is what they can deliver. > > Before your patches, I see: > > IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 > IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 > IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 > IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 > > at 2.82M ints/sec. With the patches, I see: > > IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31 > IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31 > IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32 > > at 2.34M ints/sec. So a nice reduction in interrupt rate, though not > quite at the extent I expected. Booted with 'posted_msi' and I do see > posted interrupts increasing in the PMN in /proc/interrupts, > The ints/sec reduction is not as high as I expected either, especially at this high rate. Which means not enough coalescing going on to get the performance benefits. The opportunity of IRQ coalescing is also dependent on how long the driver's hardirq handler executes. In the posted MSI demux loop, it does not wait for more MSIs to come before existing the pending IRQ polling loop. So if the hardirq handler finishes very quickly, it may not coalesce as much. Perhaps, we need to find more "useful" work to do to maximize the window for coalescing. I am not familiar with optane driver, need to look into how its hardirq handler work. I have only tested NVMe gen5 in terms of storage IO, i saw 30-50% ints/sec reduction at even lower IRQ rate (200k/sec). > Probably want to fold this one in: > > diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c > index 8e09d40ea928..a289282f1cf9 100644 > --- a/arch/x86/kernel/irq.c > +++ b/arch/x86/kernel/irq.c > @@ -393,7 +393,7 @@ void intel_posted_msi_init(void) > * instead of: > * read, xchg, read, xchg, read, xchg, read, xchg > */ > -static __always_inline inline bool handle_pending_pir(u64 *pir, struct > pt_regs *regs) +static __always_inline bool handle_pending_pir(u64 *pir, > struct pt_regs *regs) { > int i, vec = FIRST_EXTERNAL_VECTOR; > unsigned long pir_copy[4]; > Good catch! will do. Thanks, Jacob
On 2/9/24 10:43 AM, Jacob Pan wrote: > Hi Jens, > > On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote: > >> Hi Jacob, >> >> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just >> IOPS bound on the drive, and using 1 thread per drive for IO. Random >> reads, using io_uring. >> >> For reference, using polled IO: >> >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 >> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31 >> >> which is abount 5.1M/drive, which is what they can deliver. >> >> Before your patches, I see: >> >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 >> >> at 2.82M ints/sec. With the patches, I see: >> >> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31 >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31 >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32 >> >> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not >> quite at the extent I expected. Booted with 'posted_msi' and I do see >> posted interrupts increasing in the PMN in /proc/interrupts, >> > The ints/sec reduction is not as high as I expected either, especially > at this high rate. Which means not enough coalescing going on to get the > performance benefits. Right, it means that we're getting pretty decent commands-per-int coalescing already. I added another drive and repeated, here's that one: IOPS w/polled: 25.7M IOPS Stock kernel: IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32 IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32 IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32 at ~3.7M ints/sec, or about 5.8 IOPS / int on average. Patched kernel: IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32 IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31 IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32 at the same interrupt rate. So not a reduction, but slighter higher perf. Maybe we're reaping more commands on average per interrupt. Anyway, not a lot of interesting data there, just figured I'd re-run it with the added drive. > The opportunity of IRQ coalescing is also dependent on how long the > driver's hardirq handler executes. In the posted MSI demux loop, it does > not wait for more MSIs to come before existing the pending IRQ polling > loop. So if the hardirq handler finishes very quickly, it may not coalesce > as much. Perhaps, we need to find more "useful" work to do to maximize the > window for coalescing. > > I am not familiar with optane driver, need to look into how its hardirq > handler work. I have only tested NVMe gen5 in terms of storage IO, i saw > 30-50% ints/sec reduction at even lower IRQ rate (200k/sec). It's just an nvme device, so it's the nvme driver. The IRQ side is very cheap - for as long as there are CQEs in the completion ring, it'll reap them and complete them. That does mean that if we get an IRQ and there's more than one entry to complete, we will do all of them. No IRQ coalescing is configured (nvme kind of sucks for that...), but optane media is much faster than flash, so that may be a difference.
Hi Jens, On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@kernel.dk> wrote: > On 2/9/24 10:43 AM, Jacob Pan wrote: > > Hi Jens, > > > > On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote: > > > >> Hi Jacob, > >> > >> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just > >> IOPS bound on the drive, and using 1 thread per drive for IO. Random > >> reads, using io_uring. > >> > >> For reference, using polled IO: > >> > >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 > >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 > >> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31 > >> > >> which is abount 5.1M/drive, which is what they can deliver. > >> > >> Before your patches, I see: > >> > >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 > >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 > >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 > >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 > >> > >> at 2.82M ints/sec. With the patches, I see: > >> > >> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31 > >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31 > >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32 > >> > >> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not > >> quite at the extent I expected. Booted with 'posted_msi' and I do see > >> posted interrupts increasing in the PMN in /proc/interrupts, > >> > > The ints/sec reduction is not as high as I expected either, especially > > at this high rate. Which means not enough coalescing going on to get the > > performance benefits. > > Right, it means that we're getting pretty decent commands-per-int > coalescing already. I added another drive and repeated, here's that one: > > IOPS w/polled: 25.7M IOPS > > Stock kernel: > > IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32 > IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32 > IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32 > > at ~3.7M ints/sec, or about 5.8 IOPS / int on average. > > Patched kernel: > > IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32 > IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31 > IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32 > > at the same interrupt rate. So not a reduction, but slighter higher > perf. Maybe we're reaping more commands on average per interrupt. > > Anyway, not a lot of interesting data there, just figured I'd re-run it > with the added drive. > > > The opportunity of IRQ coalescing is also dependent on how long the > > driver's hardirq handler executes. In the posted MSI demux loop, it does > > not wait for more MSIs to come before existing the pending IRQ polling > > loop. So if the hardirq handler finishes very quickly, it may not > > coalesce as much. Perhaps, we need to find more "useful" work to do to > > maximize the window for coalescing. > > > > I am not familiar with optane driver, need to look into how its hardirq > > handler work. I have only tested NVMe gen5 in terms of storage IO, i saw > > 30-50% ints/sec reduction at even lower IRQ rate (200k/sec). > > It's just an nvme device, so it's the nvme driver. The IRQ side is very > cheap - for as long as there are CQEs in the completion ring, it'll reap > them and complete them. That does mean that if we get an IRQ and there's > more than one entry to complete, we will do all of them. No IRQ > coalescing is configured (nvme kind of sucks for that...), but optane > media is much faster than flash, so that may be a difference. > Yeah, I also check the the driver code it seems just wake up the threaded handler. For the record, here is my set up and performance data for 4 Samsung disks. IOPS increased from 1.6M per disk to 2.1M. One difference I noticed is that IRQ throughput is improved instead of reduction with this patch on my setup. e.g. BEFORE: 185545/sec/vector AFTER: 220128 CPU: (highest non-turbo freq, maybe different on yours). echo "Set CPU frequency P1 2.7GHz" for i in `seq 0 1 127`; do echo 2700000 > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_max_freq ;done for i in `seq 0 1 127`; do echo 2700000 > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_min_freq ;done PCI: [root@emr-bkc posted_msi_tests]# lspci -vv -nn -s 0000:64:00.0|grep -e Lnk -e Sam -e nvme 64:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM174X [144d:a826] (prog-if 02 [NVM Express]) Subsystem: Samsung Electronics Co Ltd Device [144d:aa0a] LnkCap: Port #0, Speed 32GT/s, Width x4, ASPM notsupported LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled-CommClk+ LnkSta: Speed 32GT/s (ok), Width x4(ok) LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis NVME setup: nvme5n1 SAMSUNG MZWLO1T9HCJR-00A07 nvme6n1 SAMSUNG MZWLO1T9HCJR-00A07 nvme3n1 SAMSUNG MZWLO1T9HCJR-00A07 nvme4n1 SAMSUNG MZWLO1T9HCJR-00A07 FIO: [global] bs=4k direct=1 norandommap ioengine=libaio randrepeat=0 readwrite=randread group_reporting time_based iodepth=64 exitall random_generator=tausworthe64 runtime=30 ramp_time=3 numjobs=8 group_reporting=1 #cpus_allowed_policy=shared cpus_allowed_policy=split [disk_nvme6n1_thread_1] filename=/dev/nvme6n1 cpus_allowed=0-7 [disk_nvme6n1_thread_1] filename=/dev/nvme5n1 cpus_allowed=8-15 [disk_nvme5n1_thread_2] filename=/dev/nvme4n1 cpus_allowed=16-23 [disk_nvme5n1_thread_3] filename=/dev/nvme3n1 cpus_allowed=24-31 iostat w/o posted MSI patch, v6.8-rc1: nvme3c3n1 1615525.00 6462100.00 0.00 0.00 6462100 nvme4c4n1 1615471.00 6461884.00 0.00 0.00 6461884 nvme5c5n1 1615602.00 6462408.00 0.00 0.00 6462408 nvme6c6n1 1614637.00 6458544.00 0.00 0.00 6458544 irqtop (delta 1 sec.) IRQ TOTAL DELTA NAME 800 6290026 185545 IR-PCI-MSIX-0000:65:00.0 76-edge nvme5q76 797 6279554 185295 IR-PCI-MSIX-0000:65:00.0 73-edge nvme5q73 799 6281627 185200 IR-PCI-MSIX-0000:65:00.0 75-edge nvme5q75 802 6285742 185185 IR-PCI-MSIX-0000:65:00.0 78-edge nvme5q78 ... ... similar irq rate for all 32 vectors iostat w/ posted MSI patch: Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd nvme3c3n1 2184313.00 8737256.00 0.00 0.00 8737256 0 0 nvme4c4n1 2184241.00 8736972.00 0.00 0.00 8736972 0 0 nvme5c5n1 2184269.00 8737080.00 0.00 0.00 8737080 0 0 nvme6c6n1 2184003.00 8736012.00 0.00 0.00 8736012 0 0 irqtop w/ posted MSI patch: IRQ TOTAL DELTA NAME PMN 5230078416 5502657 Posted MSI notification event 423 138068935 220128 IR-PCI-MSIX-0000:64:00.0 80-edge nvme4q80 425 138057654 219963 IR-PCI-MSIX-0000:64:00.0 82-edge nvme4q82 426 138101745 219890 IR-PCI-MSIX-0000:64:00.0 83-edge nvme4q83 ... ... similar irq rate for all 32 vectors IRQ coalescing ratio: posted interrupt notification (PMN)/total MSIs = 78% 550/(22*32.)=.78125 Thanks, Jacob
On 2/12/24 11:27 AM, Jacob Pan wrote: > Hi Jens, > > On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@kernel.dk> wrote: > >> On 2/9/24 10:43 AM, Jacob Pan wrote: >>> Hi Jens, >>> >>> On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote: >>> >>>> Hi Jacob, >>>> >>>> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just >>>> IOPS bound on the drive, and using 1 thread per drive for IO. Random >>>> reads, using io_uring. >>>> >>>> For reference, using polled IO: >>>> >>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 >>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 >>>> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31 >>>> >>>> which is abount 5.1M/drive, which is what they can deliver. >>>> >>>> Before your patches, I see: >>>> >>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 >>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 >>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 >>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 >>>> >>>> at 2.82M ints/sec. With the patches, I see: >>>> >>>> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31 >>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31 >>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32 >>>> >>>> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not >>>> quite at the extent I expected. Booted with 'posted_msi' and I do see >>>> posted interrupts increasing in the PMN in /proc/interrupts, >>>> >>> The ints/sec reduction is not as high as I expected either, especially >>> at this high rate. Which means not enough coalescing going on to get the >>> performance benefits. >> >> Right, it means that we're getting pretty decent commands-per-int >> coalescing already. I added another drive and repeated, here's that one: >> >> IOPS w/polled: 25.7M IOPS >> >> Stock kernel: >> >> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32 >> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32 >> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32 >> >> at ~3.7M ints/sec, or about 5.8 IOPS / int on average. >> >> Patched kernel: >> >> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32 >> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31 >> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32 >> >> at the same interrupt rate. So not a reduction, but slighter higher >> perf. Maybe we're reaping more commands on average per interrupt. >> >> Anyway, not a lot of interesting data there, just figured I'd re-run it >> with the added drive. >> >>> The opportunity of IRQ coalescing is also dependent on how long the >>> driver's hardirq handler executes. In the posted MSI demux loop, it does >>> not wait for more MSIs to come before existing the pending IRQ polling >>> loop. So if the hardirq handler finishes very quickly, it may not >>> coalesce as much. Perhaps, we need to find more "useful" work to do to >>> maximize the window for coalescing. >>> >>> I am not familiar with optane driver, need to look into how its hardirq >>> handler work. I have only tested NVMe gen5 in terms of storage IO, i saw >>> 30-50% ints/sec reduction at even lower IRQ rate (200k/sec). >> >> It's just an nvme device, so it's the nvme driver. The IRQ side is very >> cheap - for as long as there are CQEs in the completion ring, it'll reap >> them and complete them. That does mean that if we get an IRQ and there's >> more than one entry to complete, we will do all of them. No IRQ >> coalescing is configured (nvme kind of sucks for that...), but optane >> media is much faster than flash, so that may be a difference. >> > Yeah, I also check the the driver code it seems just wake up the threaded > handler. That only happens if you're using threaded interrupts, which is not the default as it's much slower. What happens for the normal case is that we init a batch, and then poll the CQ ring for completions, and then add them to the completion batch. Once no more are found, we complete the batch. You're not using threaded interrupts, are you? > For the record, here is my set up and performance data for 4 Samsung disks. > IOPS increased from 1.6M per disk to 2.1M. One difference I noticed is that > IRQ throughput is improved instead of reduction with this patch on my setup. > e.g. BEFORE: 185545/sec/vector > AFTER: 220128 I'm surprised at the rates being that low, and if so, why the posted MSI makes a difference? Usually what I've seen for IRQ being slower than poll is if interrupt delivery is unreasonably slow on that architecture of machine. But ~200k/sec isn't that high at all. > [global] > bs=4k > direct=1 > norandommap > ioengine=libaio > randrepeat=0 > readwrite=randread > group_reporting > time_based > iodepth=64 > exitall > random_generator=tausworthe64 > runtime=30 > ramp_time=3 > numjobs=8 > group_reporting=1 > > #cpus_allowed_policy=shared > cpus_allowed_policy=split > [disk_nvme6n1_thread_1] > filename=/dev/nvme6n1 > cpus_allowed=0-7 > [disk_nvme6n1_thread_1] > filename=/dev/nvme5n1 > cpus_allowed=8-15 > [disk_nvme5n1_thread_2] > filename=/dev/nvme4n1 > cpus_allowed=16-23 > [disk_nvme5n1_thread_3] > filename=/dev/nvme3n1 > cpus_allowed=24-31 For better performance, I'd change that engine=libaio to: ioengine=io_uring fixedbufs=1 registerfiles=1 Particularly fixedbufs makes a big difference, as a big cycle consumer is mapping/unmapping pages from the application space into the kernel for O_DIRECT. With fixedbufs=1, this is done once and we just reuse the buffers. At least for my runs, this is ~15% of the systime for doing IO. It also removes the page referencing, which isn't as big a consumer, but still noticeable. Anyway, side quest, but I think you'll find this considerably reduces overhead / improves performance. Also makes it so that you can compare with polled IO on nvme, which aio can't do. You'd just add hipri=1 as an option for that (with a side note that you need to configure nvme poll queues, see the poll_queues parameter). On my box, all the NVMe devices seem to be on node1, not node0 which looks like it's the CPUs you are using. Might be worth checking and adjusting your CPU domains for each drive? I also tend to get better performance by removing the CPU scheduler, eg just pin each job to a single CPU rather than many. It's just one process/thread anyway, so really no point in giving it options here. It'll help reduce variability too, which can be a pain in the butt to deal with.
Hi Jens, On Mon, 12 Feb 2024 11:36:42 -0700, Jens Axboe <axboe@kernel.dk> wrote: > On 2/12/24 11:27 AM, Jacob Pan wrote: > > Hi Jens, > > > > On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@kernel.dk> wrote: > > > >> On 2/9/24 10:43 AM, Jacob Pan wrote: > >>> Hi Jens, > >>> > >>> On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote: > >>> > >>>> Hi Jacob, > >>>> > >>>> I gave this a quick spin, using 4 gen2 optane drives. Basic test, > >>>> just IOPS bound on the drive, and using 1 thread per drive for IO. > >>>> Random reads, using io_uring. > >>>> > >>>> For reference, using polled IO: > >>>> > >>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 > >>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 > >>>> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31 > >>>> > >>>> which is abount 5.1M/drive, which is what they can deliver. > >>>> > >>>> Before your patches, I see: > >>>> > >>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 > >>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 > >>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 > >>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 > >>>> > >>>> at 2.82M ints/sec. With the patches, I see: > >>>> > >>>> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31 > >>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31 > >>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32 > >>>> > >>>> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not > >>>> quite at the extent I expected. Booted with 'posted_msi' and I do see > >>>> posted interrupts increasing in the PMN in /proc/interrupts, > >>>> > >>> The ints/sec reduction is not as high as I expected either, especially > >>> at this high rate. Which means not enough coalescing going on to get > >>> the performance benefits. > >> > >> Right, it means that we're getting pretty decent commands-per-int > >> coalescing already. I added another drive and repeated, here's that > >> one: > >> > >> IOPS w/polled: 25.7M IOPS > >> > >> Stock kernel: > >> > >> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32 > >> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32 > >> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32 > >> > >> at ~3.7M ints/sec, or about 5.8 IOPS / int on average. > >> > >> Patched kernel: > >> > >> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32 > >> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31 > >> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32 > >> > >> at the same interrupt rate. So not a reduction, but slighter higher > >> perf. Maybe we're reaping more commands on average per interrupt. > >> > >> Anyway, not a lot of interesting data there, just figured I'd re-run it > >> with the added drive. > >> > >>> The opportunity of IRQ coalescing is also dependent on how long the > >>> driver's hardirq handler executes. In the posted MSI demux loop, it > >>> does not wait for more MSIs to come before existing the pending IRQ > >>> polling loop. So if the hardirq handler finishes very quickly, it may > >>> not coalesce as much. Perhaps, we need to find more "useful" work to > >>> do to maximize the window for coalescing. > >>> > >>> I am not familiar with optane driver, need to look into how its > >>> hardirq handler work. I have only tested NVMe gen5 in terms of > >>> storage IO, i saw 30-50% ints/sec reduction at even lower IRQ rate > >>> (200k/sec). > >> > >> It's just an nvme device, so it's the nvme driver. The IRQ side is very > >> cheap - for as long as there are CQEs in the completion ring, it'll > >> reap them and complete them. That does mean that if we get an IRQ and > >> there's more than one entry to complete, we will do all of them. No IRQ > >> coalescing is configured (nvme kind of sucks for that...), but optane > >> media is much faster than flash, so that may be a difference. > >> > > Yeah, I also check the the driver code it seems just wake up the > > threaded handler. > > That only happens if you're using threaded interrupts, which is not the > default as it's much slower. What happens for the normal case is that we > init a batch, and then poll the CQ ring for completions, and then add > them to the completion batch. Once no more are found, we complete the > batch. > thanks for the explanation. > You're not using threaded interrupts, are you? No. I didn't add module parameter "use_threaded_interrupts" > > > For the record, here is my set up and performance data for 4 Samsung > > disks. IOPS increased from 1.6M per disk to 2.1M. One difference I > > noticed is that IRQ throughput is improved instead of reduction with > > this patch on my setup. e.g. BEFORE: 185545/sec/vector > > AFTER: 220128 > > I'm surprised at the rates being that low, and if so, why the posted MSI > makes a difference? Usually what I've seen for IRQ being slower than > poll is if interrupt delivery is unreasonably slow on that architecture > of machine. But ~200k/sec isn't that high at all. > > > [global] > > bs=4k > > direct=1 > > norandommap > > ioengine=libaio > > randrepeat=0 > > readwrite=randread > > group_reporting > > time_based > > iodepth=64 > > exitall > > random_generator=tausworthe64 > > runtime=30 > > ramp_time=3 > > numjobs=8 > > group_reporting=1 > > > > #cpus_allowed_policy=shared > > cpus_allowed_policy=split > > [disk_nvme6n1_thread_1] > > filename=/dev/nvme6n1 > > cpus_allowed=0-7 > > [disk_nvme6n1_thread_1] > > filename=/dev/nvme5n1 > > cpus_allowed=8-15 > > [disk_nvme5n1_thread_2] > > filename=/dev/nvme4n1 > > cpus_allowed=16-23 > > [disk_nvme5n1_thread_3] > > filename=/dev/nvme3n1 > > cpus_allowed=24-31 > > For better performance, I'd change that engine=libaio to: > > ioengine=io_uring > fixedbufs=1 > registerfiles=1 > > Particularly fixedbufs makes a big difference, as a big cycle consumer > is mapping/unmapping pages from the application space into the kernel > for O_DIRECT. With fixedbufs=1, this is done once and we just reuse the > buffers. At least for my runs, this is ~15% of the systime for doing IO. > It also removes the page referencing, which isn't as big a consumer, but > still noticeable. > Indeed, the CPU utilization system time goes down significantly. I got the following with posted MSI patch applied: Before (aio): read: IOPS=8925k, BW=34.0GiB/s (36.6GB/s)(1021GiB/30001msec) user 3m25.156s sys 11m16.785s After (fixedbufs, iouring engine): read: IOPS=8811k, BW=33.6GiB/s (36.1GB/s)(1008GiB/30002msec) user 2m56.255s sys 8m56.378s It seems to have no gain in IOPS, just CPU utilization reduction. Both have improvement over libaio w/o posted MSI patch. > Anyway, side quest, but I think you'll find this considerably reduces > overhead / improves performance. Also makes it so that you can compare > with polled IO on nvme, which aio can't do. You'd just add hipri=1 as an > option for that (with a side note that you need to configure nvme poll > queues, see the poll_queues parameter). > > On my box, all the NVMe devices seem to be on node1, not node0 which > looks like it's the CPUs you are using. Might be worth checking and > adjusting your CPU domains for each drive? I also tend to get better > performance by removing the CPU scheduler, eg just pin each job to a > single CPU rather than many. It's just one process/thread anyway, so > really no point in giving it options here. It'll help reduce variability > too, which can be a pain in the butt to deal with. > Much faster with poll_queues=32 (32jobs) read: IOPS=13.0M, BW=49.6GiB/s (53.3GB/s)(1489GiB/30001msec) user 2m29.177s sys 15m7.022s Observed no IRQ counts from NVME. Thanks, Jacob
Hi Jens, On Mon, 12 Feb 2024 11:36:42 -0700, Jens Axboe <axboe@kernel.dk> wrote: > > For the record, here is my set up and performance data for 4 Samsung > > disks. IOPS increased from 1.6M per disk to 2.1M. One difference I > > noticed is that IRQ throughput is improved instead of reduction with > > this patch on my setup. e.g. BEFORE: 185545/sec/vector > > AFTER: 220128 > > I'm surprised at the rates being that low, and if so, why the posted MSI > makes a difference? Usually what I've seen for IRQ being slower than > poll is if interrupt delivery is unreasonably slow on that architecture > of machine. But ~200k/sec isn't that high at all. Even at ~200k/sec, I am seeing around 75% ratio between posted interrupt notification and MSIs. i.e. for every 4 MSIs, we save one CPU notification. That might be where the savings come from. I was expecting an even or reduction in CPU notifications but more MSI throughput. Instead, Optane gets less MSIs/sec as your data shows. Is it possible to get the interrupt coalescing ratio on your set up? ie. PMN count in cat /proc/interrupts divided by total NVME MSIs. Here is a summary of my testing on 4 Samsung Gen 5 drives: test cases IOPS*1000 ints/sec(MSI)* ================================================= aio 6348 182218 io_uring 6895 207932 aio w/ posted MSI 8295 185545 io_uring w/ post MSI 8811 220128 io_uring poll_queue 13000 0 ================================================ Thanks, Jacob
On 1/27/2024 7:42 AM, Jacob Pan wrote: > Hi Thomas and all, > > This patch set is aimed to improve IRQ throughput on Intel Xeon by making use of > posted interrupts. > > There is a session at LPC2023 IOMMU/VFIO/PCI MC where I have presented this > topic. > > https://lpc.events/event/17/sessions/172/#20231115 > > Background > ========== > On modern x86 server SoCs, interrupt remapping (IR) is required and turned > on by default to support X2APIC. Two interrupt remapping modes can be supported > by IOMMU/VT-d: > > - Remappable (host) > - Posted (guest only so far) > > With remappable mode, the device MSI to CPU process is a HW flow without system > software touch points, it roughly goes as follows: > > 1. Devices issue interrupt requests with writes to 0xFEEx_xxxx > 2. The system agent accepts and remaps/translates the IRQ > 3. Upon receiving the translation response, the system agent notifies the > destination CPU with the translated MSI > 4. CPU's local APIC accepts interrupts into its IRR/ISR registers > 5. Interrupt delivered through IDT (MSI vector) > > The above process can be inefficient under high IRQ rates. The notifications in > step #3 are often unnecessary when the destination CPU is already overwhelmed > with handling bursts of IRQs. On some architectures, such as Intel Xeon, step #3 > is also expensive and requires strong ordering w.r.t DMA. Can you tell more on this "step #3 requires strong ordering w.r.t. DMA"? > As a result, slower > IRQ rates can become a limiting factor for DMA I/O performance. >
Hi Robert, On Thu, 4 Apr 2024 21:45:05 +0800, Robert Hoo <robert.hoo.linux@gmail.com> wrote: > On 1/27/2024 7:42 AM, Jacob Pan wrote: > > Hi Thomas and all, > > > > This patch set is aimed to improve IRQ throughput on Intel Xeon by > > making use of posted interrupts. > > > > There is a session at LPC2023 IOMMU/VFIO/PCI MC where I have presented > > this topic. > > > > https://lpc.events/event/17/sessions/172/#20231115 > > > > Background > > ========== > > On modern x86 server SoCs, interrupt remapping (IR) is required and > > turned on by default to support X2APIC. Two interrupt remapping modes > > can be supported by IOMMU/VT-d: > > > > - Remappable (host) > > - Posted (guest only so far) > > > > With remappable mode, the device MSI to CPU process is a HW flow > > without system software touch points, it roughly goes as follows: > > > > 1. Devices issue interrupt requests with writes to 0xFEEx_xxxx > > 2. The system agent accepts and remaps/translates the IRQ > > 3. Upon receiving the translation response, the system agent > > notifies the destination CPU with the translated MSI > > 4. CPU's local APIC accepts interrupts into its IRR/ISR registers > > 5. Interrupt delivered through IDT (MSI vector) > > > > The above process can be inefficient under high IRQ rates. The > > notifications in step #3 are often unnecessary when the destination CPU > > is already overwhelmed with handling bursts of IRQs. On some > > architectures, such as Intel Xeon, step #3 is also expensive and > > requires strong ordering w.r.t DMA. > > Can you tell more on this "step #3 requires strong ordering w.r.t. DMA"? > I am not sure how much micro architecture details I can disclose but the point is that there are ordering rules related to DMA read/writes and posted MSI writes. I am not a hardware expert. From PCIe pov, my understanding is that the upstream writes tested here on NVMe drives as the result of 4K random reads are relaxed ordered. I can see lspci showing: RlxdOrd+ on my Samsung drives. DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 512 bytes, MaxReadReq 4096 bytes But MSIs are strictly ordered afaik. > > As a result, slower > > IRQ rates can become a limiting factor for DMA I/O performance. > > > > Thanks, Jacob