mbox series

[v2,0/6] RFC: gup+dma: tracking dma-pinned pages

Message ID 20181110085041.10071-1-jhubbard@nvidia.com (mailing list archive)
Headers show
Series RFC: gup+dma: tracking dma-pinned pages | expand

Message

john.hubbard@gmail.com Nov. 10, 2018, 8:50 a.m. UTC
From: John Hubbard <jhubbard@nvidia.com>

Hi, here is fodder for conversation during LPC.

Changes since v1:

a) Uses a simpler set/clear/test bit approach in the page flags.

b) Added some initial performance results in the cover letter here, below.

c) Rebased to latest linux.git

d) Puts pages back on the LRU when its done with them.

Here is an updated proposal for tracking pages that have had
get_user_pages*() called on them. This is in support of fixing the problem
discussed in [1]. This RFC only shows how to set up a reliable
PageDmaPinned flag. What to *do* with that flag is left for a later
discussion.

The sequence would be:

    -- apply patches 1-2,

    -- convert the rest of the subsystems to call put_user_page*(). Patch #3
       is an example of that. It converts infiniband.

    -- apply patches 4-6,

    -- apply more patches, to actually use the new PageDmaPinned flag.

One question up front was, "how do we ensure that either put_user_page()
or put_page() are called, depending on whether the page came from
get_user_pages() or not?". From this series, you can see that:

    -- It's possible to assert within put_page(), that we are probably in the
       right place. In practice this assertion definitely helps.

    -- In the other direction, if put_user_page() is called when put_page()
       should have been used instead, then a "clear" report of LRU list
       corruption shows up reliably, because the first thing put_user_page()
       attempts is to decrement the lru's list.prev pointer, and so you'll
       see pointer values that are one less than an aligned pointer value.
       This is not great, but it's usable. So I think that the conversion
       will turn into an exercise of trying to get code coverage, and that
       should work out.

I have lots of other patches, not shown here, in various stages of polish, to
convert enough of the kernel in order to run fio [2]. I did that, and got some
rather noisy performance results.

Here they are again:

Performance notes:

1. These fio results are noisy. The std deviation is large enough
that some of this could be noise. In order to highlight that, I did 5 runs
each of with, and without the patch, and while there is definitely a
performance drop on average, it's also true that there is overlap in the
results. In other words, the best "with patch" run is about the same as
the worst "without patch" run.

2. Initial profiling shows that we're adding about 1% total to the this
particular test run...I think. I may have to narrow this down some more,
but I don't seem to see any real lock contention. Hints or ideas on
measurement methods are welcome, btw.

    -- 0.59% in put_user_page
    -- 0.2% (or 0.54%, depending on how you read the perf out below) in
       get_user_pages_fast:


          1.36%--iov_iter_get_pages
                    |
                     --1.27%--get_user_pages_fast
                               |
                                --0.54%--pin_page_for_dma

          0.59%--put_user_page

          1.34%     0.03%  fio   [kernel.vmlinux]     [k] _raw_spin_lock
          0.95%     0.55%  fio   [kernel.vmlinux]     [k] do_raw_spin_lock
          0.17%     0.03%  fio   [kernel.vmlinux]     [k] isolate_lru_page
          0.06%     0.00%  fio   [kernel.vmlinux]     [k] putback_lru_page

4. Here's some raw fio data: one run without the patch, and two with the patch:

------------------------------------------------------
WITHOUT the patch:
------------------------------------------------------
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov  6 20:18:06 2018
   read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec)
    slat (usec): min=25, max=4870, avg=68.19, stdev=85.21
    clat (usec): min=74, max=19814, avg=4525.40, stdev=1844.03
     lat (usec): min=183, max=19927, avg=4593.69, stdev=1866.65
    clat percentiles (usec):
     |  1.00th=[ 3687],  5.00th=[ 3720], 10.00th=[ 3720], 20.00th=[ 3752],
     | 30.00th=[ 3752], 40.00th=[ 3752], 50.00th=[ 3752], 60.00th=[ 3785],
     | 70.00th=[ 4178], 80.00th=[ 4490], 90.00th=[ 6652], 95.00th=[ 8225],
     | 99.00th=[13173], 99.50th=[14353], 99.90th=[16581], 99.95th=[17171],
     | 99.99th=[18220]
   bw (  KiB/s): min=49920, max=59320, per=100.00%, avg=55742.24, stdev=2224.20, samples=37
   iops        : min=12480, max=14830, avg=13935.35, stdev=556.05, samples=37
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=68.78%, 10=28.14%, 20=3.08%
  cpu          : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=54.4MiB/s (57.0MB/s), 54.4MiB/s-54.4MiB/s (57.0MB/s-57.0MB/s), io=1024MiB (1074MB), run=18826-18826msec

Disk stats (read/write):
  nvme0n1: ios=259490/1, merge=0/0, ticks=14822/0, in_queue=19241, util=100.00%

------------------------------------------------------
With the patch applied:
------------------------------------------------------
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=51.2MiB/s,w=0KiB/s][r=13.1k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=2568: Tue Nov  6 20:03:50 2018
   read: IOPS=12.8k, BW=50.1MiB/s (52.5MB/s)(1024MiB/20453msec)
    slat (usec): min=33, max=4365, avg=74.05, stdev=85.79
    clat (usec): min=39, max=19818, avg=4916.61, stdev=1961.79
     lat (usec): min=100, max=20002, avg=4990.78, stdev=1985.23
    clat percentiles (usec):
     |  1.00th=[ 4047],  5.00th=[ 4080], 10.00th=[ 4080], 20.00th=[ 4080],
     | 30.00th=[ 4113], 40.00th=[ 4113], 50.00th=[ 4113], 60.00th=[ 4146],
     | 70.00th=[ 4178], 80.00th=[ 4817], 90.00th=[ 7308], 95.00th=[ 8717],
     | 99.00th=[14091], 99.50th=[15270], 99.90th=[17433], 99.95th=[18220],
     | 99.99th=[19006]
   bw (  KiB/s): min=45370, max=55784, per=100.00%, avg=51332.33, stdev=1843.77, samples=40
   iops        : min=11342, max=13946, avg=12832.83, stdev=460.92, samples=40
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=96.44%, 20=3.53%
  cpu          : usr=2.91%, sys=95.18%, ctx=398, majf=0, minf=73
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=50.1MiB/s (52.5MB/s), 50.1MiB/s-50.1MiB/s (52.5MB/s-52.5MB/s), io=1024MiB (1074MB), run=20453-20453msec

Disk stats (read/write):
  nvme0n1: ios=261399/0, merge=0/0, ticks=16019/0, in_queue=20910, util=100.00%

------------------------------------------------------
OR, here's a better run WITH the patch applied, and you can see that this is nearly as good
as the "without" case:
------------------------------------------------------

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov  6 20:01:33 2018
   read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec)
    slat (usec): min=30, max=12458, avg=69.71, stdev=88.01
    clat (usec): min=39, max=25590, avg=4687.42, stdev=1925.29
     lat (usec): min=97, max=25704, avg=4757.25, stdev=1946.06
    clat percentiles (usec):
     |  1.00th=[ 3884],  5.00th=[ 3884], 10.00th=[ 3916], 20.00th=[ 3916],
     | 30.00th=[ 3916], 40.00th=[ 3916], 50.00th=[ 3949], 60.00th=[ 3949],
     | 70.00th=[ 3982], 80.00th=[ 4555], 90.00th=[ 6915], 95.00th=[ 8848],
     | 99.00th=[13566], 99.50th=[14877], 99.90th=[16909], 99.95th=[17695],
     | 99.99th=[24249]
   bw (  KiB/s): min=48905, max=58016, per=100.00%, avg=53855.79, stdev=2115.03, samples=38
   iops        : min=12226, max=14504, avg=13463.79, stdev=528.76, samples=38
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=71.80%, 10=24.66%, 20=3.51%, 50=0.02%
  cpu          : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=52.5MiB/s (55.1MB/s), 52.5MiB/s-52.5MiB/s (55.1MB/s-55.1MB/s), io=1024MiB (1074MB), run=19499-19499msec

Disk stats (read/write):
  nvme0n1: ios=260720/0, merge=0/0, ticks=15036/0, in_queue=19876, util=100.00%


[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] fio: https://github.com/axboe/fio

John Hubbard (6):
  mm/gup: finish consolidating error handling
  mm: introduce put_user_page*(), placeholder versions
  infiniband/mm: convert put_page() to put_user_page*()
  mm: introduce page->dma_pinned_flags, _count
  mm: introduce zone_gup_lock, for dma-pinned pages
  mm: track gup pages with page->dma_pinned_* fields

 drivers/infiniband/core/umem.c              |   7 +-
 drivers/infiniband/core/umem_odp.c          |   2 +-
 drivers/infiniband/hw/hfi1/user_pages.c     |  11 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |   6 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |  11 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |   6 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c    |   7 +-
 include/linux/mm.h                          |  13 ++
 include/linux/mm_types.h                    |  22 +++-
 include/linux/mmzone.h                      |   6 +
 include/linux/page-flags.h                  |  61 +++++++++
 mm/gup.c                                    |  58 +++++++-
 mm/memcontrol.c                             |   8 ++
 mm/page_alloc.c                             |   1 +
 mm/swap.c                                   | 138 ++++++++++++++++++++
 15 files changed, 320 insertions(+), 37 deletions(-)

Comments

Tom Talpey Nov. 19, 2018, 6:57 p.m. UTC | #1
John, thanks for the discussion at LPC. One of the concerns we
raised however was the performance test. The numbers below are
rather obviously tainted. I think we need to get a better baseline
before concluding anything...

Here's my main concern:

On 11/10/2018 3:50 AM, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
>...
> ------------------------------------------------------
> WITHOUT the patch:
> ------------------------------------------------------
> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
> fio-3.3
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 00m:00s]
> reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov  6 20:18:06 2018
>     read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec)

~14000 4KB read IOPS is really, really low for an NVMe disk.

>    cpu          : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72

CPU is obviously the limiting factor. At these IOPS, it should be far
less.
> ------------------------------------------------------
> OR, here's a better run WITH the patch applied, and you can see that this is nearly as good
> as the "without" case:
> ------------------------------------------------------
> 
> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
> fio-3.3
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 00m:00s]
> reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov  6 20:01:33 2018
>     read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec)

Similar low IOPS.

>    cpu          : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73

Similar CPU saturation.

>

I get nearly 400,000 4KB IOPS on my tiny desktop, which has a 25W
i7-7500 and a Samsung PM961 128GB NVMe (stock Bionic 4.15 kernel
and fio version 3.1). Even then, the CPU saturates, so it's not
necessarily a perfect test. I'd like to see your runs both get to
"max" IOPS, i.e. CPU < 100%, and compare the CPU numbers. This would
give the best comparison for making a decision.

Can you confirm what type of hardware you're running this test on?
CPU, memory speed and capacity, and NVMe device especially?

Tom.
John Hubbard Nov. 21, 2018, 6:09 a.m. UTC | #2
On 11/19/18 10:57 AM, Tom Talpey wrote:
> John, thanks for the discussion at LPC. One of the concerns we
> raised however was the performance test. The numbers below are
> rather obviously tainted. I think we need to get a better baseline
> before concluding anything...
> 
> Here's my main concern:
> 

Hi Tom,

Thanks again for looking at this!


> On 11/10/2018 3:50 AM, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>> ...
>> ------------------------------------------------------
>> WITHOUT the patch:
>> ------------------------------------------------------
>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
>> fio-3.3
>> Starting 1 process
>> Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 00m:00s]
>> reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov  6 20:18:06 2018
>>     read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec)
> 
> ~14000 4KB read IOPS is really, really low for an NVMe disk.

Yes, but Jan Kara's original config file for fio is *intended* to highlight
the get_user_pages/put_user_pages changes. It was *not* intended to get max
performance,  as you can see by the numjobs and direct IO parameters:

cat fio.conf 
[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


So I'm thinking that this is not a "tainted" test, but rather, we're constraining
things a lot with these choices. It's hard to find a good test config to run that
allows decisions, but so far, I'm not really seeing anything that says "this
is so bad that we can't afford to fix the brokenness." I think.

After talking with you and reading this email, I did a bunch more test runs, 
varying the following fio parameters:

	-- direct
	-- numjobs
	-- iodepth

...with both the baseline 4.20-rc3 kernel, and with my patches applied. (btw, if
anyone cares, I'll post a github link that has a complete, testable patchset--not
ready for submission as such, but it works cleanly and will allow others to 
attempt to reproduce my results).

What I'm seeing is that I can get 10x or better improvements in IOPS and BW,
just by going to 10 threads and turning off direct IO--as expected. So in the end,
I increased the number of threads, and also increased iodepth a bit. 


Test results below...


> 
>>    cpu          : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72
> 
> CPU is obviously the limiting factor. At these IOPS, it should be far
> less.
>> ------------------------------------------------------
>> OR, here's a better run WITH the patch applied, and you can see that this is nearly as good
>> as the "without" case:
>> ------------------------------------------------------
>>
>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
>> fio-3.3
>> Starting 1 process
>> Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 00m:00s]
>> reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov  6 20:01:33 2018
>>     read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec)
> 
> Similar low IOPS.
> 
>>    cpu          : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73
> 
> Similar CPU saturation.
> 
>>
> 
> I get nearly 400,000 4KB IOPS on my tiny desktop, which has a 25W
> i7-7500 and a Samsung PM961 128GB NVMe (stock Bionic 4.15 kernel
> and fio version 3.1). Even then, the CPU saturates, so it's not
> necessarily a perfect test. I'd like to see your runs both get to
> "max" IOPS, i.e. CPU < 100%, and compare the CPU numbers. This would
> give the best comparison for making a decision.

I can get to CPU < 100% by increasing to 10 or 20 threads, although it
makes latency ever so much worse.

> 
> Can you confirm what type of hardware you're running this test on?
> CPU, memory speed and capacity, and NVMe device especially?
> 
> Tom.

Yes, it's a nice new system, I don't expect any strange perf problems:

CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
    (Intel X299 chipset)
Block device: nvme-Samsung_SSD_970_EVO_250GB
DRAM: 32 GB

So, here's a comparison using 20 threads, direct IO, for the baseline vs. 
patched kernel (below). Highlights:

	-- IOPS are similar, around 60k. 
	-- BW gets worse, dropping from 290 to 220 MB/s.
	-- CPU is well under 100%.
	-- latency is incredibly long, but...20 threads.

Baseline:

$ ./run.sh
fio configuration:
[reader]
ioengine=libaio
blocksize=4096
size=1g
rw=read
group_reporting
iodepth=256
direct=1
numjobs=20
-------- Running fio:
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.3
Starting 20 processes
Jobs: 4 (f=4): [_(8),R(2),_(2),R(1),_(1),R(1),_(5)][95.9%][r=244MiB/s,w=0KiB/s][r=62.5k,w=0 IOPS][eta 00m:03s]
reader: (groupid=0, jobs=20): err= 0: pid=14499: Tue Nov 20 16:20:35 2018
   read: IOPS=74.2k, BW=290MiB/s (304MB/s)(20.0GiB/70644msec)
    slat (usec): min=26, max=48167, avg=249.27, stdev=1200.02
    clat (usec): min=42, max=147792, avg=67108.56, stdev=18062.46
     lat (usec): min=103, max=147943, avg=67358.10, stdev=18109.75
    clat percentiles (msec):
     |  1.00th=[   21],  5.00th=[   40], 10.00th=[   41], 20.00th=[   47],
     | 30.00th=[   58], 40.00th=[   65], 50.00th=[   70], 60.00th=[   75],
     | 70.00th=[   79], 80.00th=[   83], 90.00th=[   89], 95.00th=[   93],
     | 99.00th=[  104], 99.50th=[  109], 99.90th=[  121], 99.95th=[  125],
     | 99.99th=[  134]
   bw (  KiB/s): min= 9712, max=46362, per=5.11%, avg=15164.99, stdev=2242.15, samples=2742
   iops        : min= 2428, max=11590, avg=3790.94, stdev=560.53, samples=2742
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.98%, 50=20.44%
  lat (msec)   : 100=76.95%, 250=1.61%
  cpu          : usr=1.00%, sys=57.65%, ctx=158367, majf=0, minf=5284
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=5242880,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=290MiB/s (304MB/s), 290MiB/s-290MiB/s (304MB/s-304MB/s), io=20.0GiB (21.5GB), run=70644-70644msec

Disk stats (read/write):
  nvme0n1: ios=5240738/7, merge=0/7, ticks=1457727/5, in_queue=1547139, util=100.00%

--------------------------------------------------------------
Patched:

<redforge> fast_256GB $ ./run.sh 
fio configuration:
[reader]
ioengine=libaio
blocksize=4096
size=1g
rw=read
group_reporting
iodepth=256
direct=1
numjobs=20
-------- Running fio:
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.3
Starting 20 processes
Jobs: 13 (f=8): [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0 IOPS][eta 00m:02s]
reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
   read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
    slat (usec): min=26, max=50436, avg=337.21, stdev=1405.14
    clat (usec): min=43, max=178839, avg=88963.96, stdev=21745.31
     lat (usec): min=106, max=179041, avg=89301.43, stdev=21800.43
    clat percentiles (msec):
     |  1.00th=[   50],  5.00th=[   53], 10.00th=[   55], 20.00th=[   68],
     | 30.00th=[   79], 40.00th=[   86], 50.00th=[   93], 60.00th=[   99],
     | 70.00th=[  103], 80.00th=[  108], 90.00th=[  114], 95.00th=[  121],
     | 99.00th=[  134], 99.50th=[  140], 99.90th=[  150], 99.95th=[  155],
     | 99.99th=[  163]
   bw (  KiB/s): min= 4920, max=39733, per=5.07%, avg=11506.18, stdev=1540.18, samples=3650
   iops        : min= 1230, max= 9933, avg=2876.20, stdev=385.05, samples=3650
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.23%, 50=1.13%
  lat (msec)   : 100=63.04%, 250=35.57%
  cpu          : usr=0.65%, sys=58.07%, ctx=188963, majf=0, minf=5303
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=5242880,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=222MiB/s (232MB/s), 222MiB/s-222MiB/s (232MB/s-232MB/s), io=20.0GiB (21.5GB), run=92385-92385msec

Disk stats (read/write):
  nvme0n1: ios=5240550/7, merge=0/7, ticks=1513681/4, in_queue=1636411, util=100.00%


Thoughts?


thanks,
Tom Talpey Nov. 21, 2018, 4:49 p.m. UTC | #3
On 11/21/2018 1:09 AM, John Hubbard wrote:
> On 11/19/18 10:57 AM, Tom Talpey wrote:
>> ~14000 4KB read IOPS is really, really low for an NVMe disk.
> 
> Yes, but Jan Kara's original config file for fio is *intended* to highlight
> the get_user_pages/put_user_pages changes. It was *not* intended to get max
> performance,  as you can see by the numjobs and direct IO parameters:
> 
> cat fio.conf
> [reader]
> direct=1
> ioengine=libaio
> blocksize=4096
> size=1g
> numjobs=1
> rw=read
> iodepth=64

To be clear - I used those identical parameters, on my lower-spec
machine, and got 400,000 4KB read IOPS. Those results are nearly 30x
higher than yours!

> So I'm thinking that this is not a "tainted" test, but rather, we're constraining
> things a lot with these choices. It's hard to find a good test config to run that
> allows decisions, but so far, I'm not really seeing anything that says "this
> is so bad that we can't afford to fix the brokenness." I think.

I'm not suggesting we tune the benchmark, I'm suggesting the results
on your system are not meaningful since they are orders of magnitude
low. And without meaningful data it's impossible to see the performance
impact of the change...

>> Can you confirm what type of hardware you're running this test on?
>> CPU, memory speed and capacity, and NVMe device especially?
>>
>> Tom.
> 
> Yes, it's a nice new system, I don't expect any strange perf problems:
> 
> CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
>      (Intel X299 chipset)
> Block device: nvme-Samsung_SSD_970_EVO_250GB
> DRAM: 32 GB

The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS
with a 4KB QD32 workload:

 
https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs

And the I7-7800X is a 6-core processor (12 hyperthreads).

> So, here's a comparison using 20 threads, direct IO, for the baseline vs.
> patched kernel (below). Highlights:
> 
> 	-- IOPS are similar, around 60k.
> 	-- BW gets worse, dropping from 290 to 220 MB/s.
> 	-- CPU is well under 100%.
> 	-- latency is incredibly long, but...20 threads.
> 
> Baseline:
> 
> $ ./run.sh
> fio configuration:
> [reader]
> ioengine=libaio
> blocksize=4096
> size=1g
> rw=read
> group_reporting
> iodepth=256
> direct=1
> numjobs=20

Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets.
That's going to cause tremendous queuing, and context switching, far
outside of the get_user_pages() change.

But even so, it only brings IOPS to 74.2K, which is still far short of
the device's 200K spec.

Comparing anyway:


> Patched:
> 
> -------- Running fio:
> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
> ...
> fio-3.3
> Starting 20 processes
> Jobs: 13 (f=8): [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0 IOPS][eta 00m:02s]
> reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
>     read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
> ...
> Thoughts?

Concern - the 74.2K IOPS unpatched drops to 56.8K patched!

What I'd really like to see is to go back to the original fio parameters
(1 thread, 64 iodepth) and try to get a result that gets at least close
to the speced 200K IOPS of the NVMe device. There seems to be something
wrong with yours, currently.

Then of course, the result with the patched get_user_pages, and
compare whichever of IOPS or CPU% changes, and how much.

If these are within a few percent, I agree it's good to go. If it's
roughly 25% like the result just above, that's a rocky road.

I can try this after the holiday on some basic hardware and might
be able to scrounge up better. Can you post that github link?

Tom.
John Hubbard Nov. 21, 2018, 10:06 p.m. UTC | #4
On 11/21/18 8:49 AM, Tom Talpey wrote:
> On 11/21/2018 1:09 AM, John Hubbard wrote:
>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>> ~14000 4KB read IOPS is really, really low for an NVMe disk.
>>
>> Yes, but Jan Kara's original config file for fio is *intended* to highlight
>> the get_user_pages/put_user_pages changes. It was *not* intended to get max
>> performance,  as you can see by the numjobs and direct IO parameters:
>>
>> cat fio.conf
>> [reader]
>> direct=1
>> ioengine=libaio
>> blocksize=4096
>> size=1g
>> numjobs=1
>> rw=read
>> iodepth=64
> 
> To be clear - I used those identical parameters, on my lower-spec
> machine, and got 400,000 4KB read IOPS. Those results are nearly 30x
> higher than yours!

OK, then something really is wrong here...

> 
>> So I'm thinking that this is not a "tainted" test, but rather, we're constraining
>> things a lot with these choices. It's hard to find a good test config to run that
>> allows decisions, but so far, I'm not really seeing anything that says "this
>> is so bad that we can't afford to fix the brokenness." I think.
> 
> I'm not suggesting we tune the benchmark, I'm suggesting the results
> on your system are not meaningful since they are orders of magnitude
> low. And without meaningful data it's impossible to see the performance
> impact of the change...
> 
>>> Can you confirm what type of hardware you're running this test on?
>>> CPU, memory speed and capacity, and NVMe device especially?
>>>
>>> Tom.
>>
>> Yes, it's a nice new system, I don't expect any strange perf problems:
>>
>> CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
>>      (Intel X299 chipset)
>> Block device: nvme-Samsung_SSD_970_EVO_250GB
>> DRAM: 32 GB
> 
> The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS
> with a 4KB QD32 workload:
> 
> 
> https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs
> 
> And the I7-7800X is a 6-core processor (12 hyperthreads).
> 
>> So, here's a comparison using 20 threads, direct IO, for the baseline vs.
>> patched kernel (below). Highlights:
>>
>>     -- IOPS are similar, around 60k.
>>     -- BW gets worse, dropping from 290 to 220 MB/s.
>>     -- CPU is well under 100%.
>>     -- latency is incredibly long, but...20 threads.
>>
>> Baseline:
>>
>> $ ./run.sh
>> fio configuration:
>> [reader]
>> ioengine=libaio
>> blocksize=4096
>> size=1g
>> rw=read
>> group_reporting
>> iodepth=256
>> direct=1
>> numjobs=20
> 
> Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets.
> That's going to cause tremendous queuing, and context switching, far
> outside of the get_user_pages() change.
> 
> But even so, it only brings IOPS to 74.2K, which is still far short of
> the device's 200K spec.
> 
> Comparing anyway:
> 
> 
>> Patched:
>>
>> -------- Running fio:
>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
>> ...
>> fio-3.3
>> Starting 20 processes
>> Jobs: 13 (f=8): [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0 IOPS][eta 00m:02s]
>> reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
>>     read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
>> ...
>> Thoughts?
> 
> Concern - the 74.2K IOPS unpatched drops to 56.8K patched!

ACK. :)

> 
> What I'd really like to see is to go back to the original fio parameters
> (1 thread, 64 iodepth) and try to get a result that gets at least close
> to the speced 200K IOPS of the NVMe device. There seems to be something
> wrong with yours, currently.

I'll dig into what has gone wrong with the test. I see fio putting data files
in the right place, so the obvious "using the wrong drive" is (probably)
not it. Even though it really feels like that sort of thing. We'll see. 

> 
> Then of course, the result with the patched get_user_pages, and
> compare whichever of IOPS or CPU% changes, and how much.
> 
> If these are within a few percent, I agree it's good to go. If it's
> roughly 25% like the result just above, that's a rocky road.
> 
> I can try this after the holiday on some basic hardware and might
> be able to scrounge up better. Can you post that github link?
> 

Here:

   git@github.com:johnhubbard/linux (branch: gup_dma_testing)
Tom Talpey Nov. 28, 2018, 1:21 a.m. UTC | #5
On 11/21/2018 5:06 PM, John Hubbard wrote:
> On 11/21/18 8:49 AM, Tom Talpey wrote:
>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>>> ~14000 4KB read IOPS is really, really low for an NVMe disk.
>>>
>>> Yes, but Jan Kara's original config file for fio is *intended* to highlight
>>> the get_user_pages/put_user_pages changes. It was *not* intended to get max
>>> performance,  as you can see by the numjobs and direct IO parameters:
>>>
>>> cat fio.conf
>>> [reader]
>>> direct=1
>>> ioengine=libaio
>>> blocksize=4096
>>> size=1g
>>> numjobs=1
>>> rw=read
>>> iodepth=64
>>
>> To be clear - I used those identical parameters, on my lower-spec
>> machine, and got 400,000 4KB read IOPS. Those results are nearly 30x
>> higher than yours!
> 
> OK, then something really is wrong here...
> 
>>
>>> So I'm thinking that this is not a "tainted" test, but rather, we're constraining
>>> things a lot with these choices. It's hard to find a good test config to run that
>>> allows decisions, but so far, I'm not really seeing anything that says "this
>>> is so bad that we can't afford to fix the brokenness." I think.
>>
>> I'm not suggesting we tune the benchmark, I'm suggesting the results
>> on your system are not meaningful since they are orders of magnitude
>> low. And without meaningful data it's impossible to see the performance
>> impact of the change...
>>
>>>> Can you confirm what type of hardware you're running this test on?
>>>> CPU, memory speed and capacity, and NVMe device especially?
>>>>
>>>> Tom.
>>>
>>> Yes, it's a nice new system, I don't expect any strange perf problems:
>>>
>>> CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
>>>       (Intel X299 chipset)
>>> Block device: nvme-Samsung_SSD_970_EVO_250GB
>>> DRAM: 32 GB
>>
>> The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS
>> with a 4KB QD32 workload:
>>
>>
>> https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs
>>
>> And the I7-7800X is a 6-core processor (12 hyperthreads).
>>
>>> So, here's a comparison using 20 threads, direct IO, for the baseline vs.
>>> patched kernel (below). Highlights:
>>>
>>>      -- IOPS are similar, around 60k.
>>>      -- BW gets worse, dropping from 290 to 220 MB/s.
>>>      -- CPU is well under 100%.
>>>      -- latency is incredibly long, but...20 threads.
>>>
>>> Baseline:
>>>
>>> $ ./run.sh
>>> fio configuration:
>>> [reader]
>>> ioengine=libaio
>>> blocksize=4096
>>> size=1g
>>> rw=read
>>> group_reporting
>>> iodepth=256
>>> direct=1
>>> numjobs=20
>>
>> Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets.
>> That's going to cause tremendous queuing, and context switching, far
>> outside of the get_user_pages() change.
>>
>> But even so, it only brings IOPS to 74.2K, which is still far short of
>> the device's 200K spec.
>>
>> Comparing anyway:
>>
>>
>>> Patched:
>>>
>>> -------- Running fio:
>>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
>>> ...
>>> fio-3.3
>>> Starting 20 processes
>>> Jobs: 13 (f=8): [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0 IOPS][eta 00m:02s]
>>> reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
>>>      read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
>>> ...
>>> Thoughts?
>>
>> Concern - the 74.2K IOPS unpatched drops to 56.8K patched!
> 
> ACK. :)
> 
>>
>> What I'd really like to see is to go back to the original fio parameters
>> (1 thread, 64 iodepth) and try to get a result that gets at least close
>> to the speced 200K IOPS of the NVMe device. There seems to be something
>> wrong with yours, currently.
> 
> I'll dig into what has gone wrong with the test. I see fio putting data files
> in the right place, so the obvious "using the wrong drive" is (probably)
> not it. Even though it really feels like that sort of thing. We'll see.
> 
>>
>> Then of course, the result with the patched get_user_pages, and
>> compare whichever of IOPS or CPU% changes, and how much.
>>
>> If these are within a few percent, I agree it's good to go. If it's
>> roughly 25% like the result just above, that's a rocky road.
>>
>> I can try this after the holiday on some basic hardware and might
>> be able to scrounge up better. Can you post that github link?
>>
> 
> Here:
> 
>     git@github.com:johnhubbard/linux (branch: gup_dma_testing)

I'm super-limited here this week hardware-wise and have not been able
to try testing with the patched kernel.

I was able to compare my earlier quick test with a Bionic 4.15 kernel
(400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
test, and without your change.

Say, that branch reports it has not had a commit since June 30. Is that
the right one? What about gup_dma_for_lpc_2018?

Tom.
John Hubbard Nov. 28, 2018, 2:52 a.m. UTC | #6
On 11/27/18 5:21 PM, Tom Talpey wrote:
> On 11/21/2018 5:06 PM, John Hubbard wrote:
>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
[...]
>>>
>>> What I'd really like to see is to go back to the original fio parameters
>>> (1 thread, 64 iodepth) and try to get a result that gets at least close
>>> to the speced 200K IOPS of the NVMe device. There seems to be something
>>> wrong with yours, currently.
>>
>> I'll dig into what has gone wrong with the test. I see fio putting data files
>> in the right place, so the obvious "using the wrong drive" is (probably)
>> not it. Even though it really feels like that sort of thing. We'll see.
>>
>>>
>>> Then of course, the result with the patched get_user_pages, and
>>> compare whichever of IOPS or CPU% changes, and how much.
>>>
>>> If these are within a few percent, I agree it's good to go. If it's
>>> roughly 25% like the result just above, that's a rocky road.
>>>
>>> I can try this after the holiday on some basic hardware and might
>>> be able to scrounge up better. Can you post that github link?
>>>
>>
>> Here:
>>
>>     git@github.com:johnhubbard/linux (branch: gup_dma_testing)
> 
> I'm super-limited here this week hardware-wise and have not been able
> to try testing with the patched kernel.
> 
> I was able to compare my earlier quick test with a Bionic 4.15 kernel
> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
> test, and without your change.
> 

So just to double check (again): you are running fio with these parameters,
right?

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64



> Say, that branch reports it has not had a commit since June 30. Is that
> the right one? What about gup_dma_for_lpc_2018?
> 

That's the right branch, but the AuthorDate for the head commit (only) somehow
got stuck in the past. I just now amended that patch with a new date and pushed 
it, so the head commit now shows Nov 27:

   https://github.com/johnhubbard/linux/commits/gup_dma_testing


The actual code is the same, though. (It is still based on Nov 19th's f2ce1065e767
commit.)


thanks,
Tom Talpey Nov. 28, 2018, 1:59 p.m. UTC | #7
On 11/27/2018 9:52 PM, John Hubbard wrote:
> On 11/27/18 5:21 PM, Tom Talpey wrote:
>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
> [...]
>>>>
>>>> What I'd really like to see is to go back to the original fio parameters
>>>> (1 thread, 64 iodepth) and try to get a result that gets at least close
>>>> to the speced 200K IOPS of the NVMe device. There seems to be something
>>>> wrong with yours, currently.
>>>
>>> I'll dig into what has gone wrong with the test. I see fio putting data files
>>> in the right place, so the obvious "using the wrong drive" is (probably)
>>> not it. Even though it really feels like that sort of thing. We'll see.
>>>
>>>>
>>>> Then of course, the result with the patched get_user_pages, and
>>>> compare whichever of IOPS or CPU% changes, and how much.
>>>>
>>>> If these are within a few percent, I agree it's good to go. If it's
>>>> roughly 25% like the result just above, that's a rocky road.
>>>>
>>>> I can try this after the holiday on some basic hardware and might
>>>> be able to scrounge up better. Can you post that github link?
>>>>
>>>
>>> Here:
>>>
>>>      git@github.com:johnhubbard/linux (branch: gup_dma_testing)
>>
>> I'm super-limited here this week hardware-wise and have not been able
>> to try testing with the patched kernel.
>>
>> I was able to compare my earlier quick test with a Bionic 4.15 kernel
>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
>> test, and without your change.
>>
> 
> So just to double check (again): you are running fio with these parameters,
> right?
> 
> [reader]
> direct=1
> ioengine=libaio
> blocksize=4096
> size=1g
> numjobs=1
> rw=read
> iodepth=64

Correct, I copy/pasted these directly. I also ran with size=10g because
the 1g provides a really small sample set.

There was one other difference, your results indicated fio 3.3 was used.
My Bionic install has fio 3.1. I don't find that relevant because our
goal is to compare before/after, which I haven't done yet.

Tom.

> 
> 
> 
>> Say, that branch reports it has not had a commit since June 30. Is that
>> the right one? What about gup_dma_for_lpc_2018?
>>
> 
> That's the right branch, but the AuthorDate for the head commit (only) somehow
> got stuck in the past. I just now amended that patch with a new date and pushed
> it, so the head commit now shows Nov 27:
> 
>     https://github.com/johnhubbard/linux/commits/gup_dma_testing
> 
> 
> The actual code is the same, though. (It is still based on Nov 19th's f2ce1065e767
> commit.)
> 
> 
> thanks,
>
John Hubbard Nov. 30, 2018, 1:39 a.m. UTC | #8
On 11/28/18 5:59 AM, Tom Talpey wrote:
> On 11/27/2018 9:52 PM, John Hubbard wrote:
>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>> [...]
>>> I'm super-limited here this week hardware-wise and have not been able
>>> to try testing with the patched kernel.
>>>
>>> I was able to compare my earlier quick test with a Bionic 4.15 kernel
>>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
>>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
>>> test, and without your change.
>>>
>>
>> So just to double check (again): you are running fio with these parameters,
>> right?
>>
>> [reader]
>> direct=1
>> ioengine=libaio
>> blocksize=4096
>> size=1g
>> numjobs=1
>> rw=read
>> iodepth=64
> 
> Correct, I copy/pasted these directly. I also ran with size=10g because
> the 1g provides a really small sample set.
> 
> There was one other difference, your results indicated fio 3.3 was used.
> My Bionic install has fio 3.1. I don't find that relevant because our
> goal is to compare before/after, which I haven't done yet.
> 

OK, the 50 MB/s was due to my particular .config. I had some expensive debug options
set in mm, fs and locking subsystems. Turning those off, I'm back up to the rated
speed of the Samsung NVMe device, so now we should have a clearer picture of the
performance that real users will see.

Continuing on, then: running a before and after test, I don't see any significant 
difference in the fio results:

fio.conf:

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64

---------------------------------------------------------
Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:

$ fio ./experimental-fio.conf 
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
    slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
    clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
     lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
    clat percentiles (usec):
     |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
     | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
     | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
     | 99.00th=[  379], 99.50th=[  594], 99.90th=[  603], 99.95th=[  611],
     | 99.99th=[12125]
   bw (  KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, stdev=22112.64, samples=2
   iops        : min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2
  lat (usec)   : 250=0.08%, 500=99.30%, 750=0.59%
  lat (msec)   : 20=0.02%
  cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1360-1360msec

Disk stats (read/write):
  nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00%

---------------------------------------------------------
With patches applied:

<redforge> fast_256GB $ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
    slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
    clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
     lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
    clat percentiles (usec):
     |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
     | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
     | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
     | 99.00th=[  379], 99.50th=[  594], 99.90th=[  603], 99.95th=[  611],
     | 99.99th=[12125]
   bw (  KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, stdev=22112.64, samples=2
   iops        : min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2
  lat (usec)   : 250=0.08%, 500=99.30%, 750=0.59%
  lat (msec)   : 20=0.02%
  cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1360-1360msec

Disk stats (read/write):
  nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00%


thanks,
Tom Talpey Nov. 30, 2018, 2:18 a.m. UTC | #9
On 11/29/2018 8:39 PM, John Hubbard wrote:
> On 11/28/18 5:59 AM, Tom Talpey wrote:
>> On 11/27/2018 9:52 PM, John Hubbard wrote:
>>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>> [...]
>>>> I'm super-limited here this week hardware-wise and have not been able
>>>> to try testing with the patched kernel.
>>>>
>>>> I was able to compare my earlier quick test with a Bionic 4.15 kernel
>>>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
>>>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
>>>> test, and without your change.
>>>>
>>>
>>> So just to double check (again): you are running fio with these parameters,
>>> right?
>>>
>>> [reader]
>>> direct=1
>>> ioengine=libaio
>>> blocksize=4096
>>> size=1g
>>> numjobs=1
>>> rw=read
>>> iodepth=64
>>
>> Correct, I copy/pasted these directly. I also ran with size=10g because
>> the 1g provides a really small sample set.
>>
>> There was one other difference, your results indicated fio 3.3 was used.
>> My Bionic install has fio 3.1. I don't find that relevant because our
>> goal is to compare before/after, which I haven't done yet.
>>
> 
> OK, the 50 MB/s was due to my particular .config. I had some expensive debug options
> set in mm, fs and locking subsystems. Turning those off, I'm back up to the rated
> speed of the Samsung NVMe device, so now we should have a clearer picture of the
> performance that real users will see.

Oh, good! I'm especially glad because I was having a heck of a time
reconfiguring the one machine I have available for this.

> Continuing on, then: running a before and after test, I don't see any significant
> difference in the fio results:

Excerpting from below:

 > Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
 >     read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
 >    cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73

vs

 > With patches applied:
 >     read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
 >    cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73

Perfect results, not CPU limited, and full IOPS.

Curiously identical, so I trust you've checked that you measured
both targets, but if so, I say it's good.

Tom.

> 
> fio.conf:
> 
> [reader]
> direct=1
> ioengine=libaio
> blocksize=4096
> size=1g
> numjobs=1
> rw=read
> iodepth=64
> 
> ---------------------------------------------------------
> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
> 
> $ fio ./experimental-fio.conf
> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
> fio-3.3
> Starting 1 process
> Jobs: 1 (f=1)
> reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
>     read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>      slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
>      clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
>       lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
>      clat percentiles (usec):
>       |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
>       | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
>       | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
>       | 99.00th=[  379], 99.50th=[  594], 99.90th=[  603], 99.95th=[  611],
>       | 99.99th=[12125]
>     bw (  KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, stdev=22112.64, samples=2
>     iops        : min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2
>    lat (usec)   : 250=0.08%, 500=99.30%, 750=0.59%
>    lat (msec)   : 20=0.02%
>    cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>       issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=64
> 
> Run status group 0 (all jobs):
>     READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1360-1360msec
> 
> Disk stats (read/write):
>    nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00%
> 
> ---------------------------------------------------------
> With patches applied:
> 
> <redforge> fast_256GB $ fio ./experimental-fio.conf
> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
> fio-3.3
> Starting 1 process
> Jobs: 1 (f=1)
> reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
>     read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>      slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
>      clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
>       lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
>      clat percentiles (usec):
>       |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
>       | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
>       | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
>       | 99.00th=[  379], 99.50th=[  594], 99.90th=[  603], 99.95th=[  611],
>       | 99.99th=[12125]
>     bw (  KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, stdev=22112.64, samples=2
>     iops        : min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2
>    lat (usec)   : 250=0.08%, 500=99.30%, 750=0.59%
>    lat (msec)   : 20=0.02%
>    cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>       issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=64
> 
> Run status group 0 (all jobs):
>     READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1360-1360msec
> 
> Disk stats (read/write):
>    nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00%
> 
> 
> thanks,
>
John Hubbard Nov. 30, 2018, 2:21 a.m. UTC | #10
On 11/29/18 6:18 PM, Tom Talpey wrote:
> On 11/29/2018 8:39 PM, John Hubbard wrote:
>> On 11/28/18 5:59 AM, Tom Talpey wrote:
>>> On 11/27/2018 9:52 PM, John Hubbard wrote:
>>>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>>>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>>> [...]
>>>>> I'm super-limited here this week hardware-wise and have not been able
>>>>> to try testing with the patched kernel.
>>>>>
>>>>> I was able to compare my earlier quick test with a Bionic 4.15 kernel
>>>>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
>>>>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
>>>>> test, and without your change.
>>>>>
>>>>
>>>> So just to double check (again): you are running fio with these parameters,
>>>> right?
>>>>
>>>> [reader]
>>>> direct=1
>>>> ioengine=libaio
>>>> blocksize=4096
>>>> size=1g
>>>> numjobs=1
>>>> rw=read
>>>> iodepth=64
>>>
>>> Correct, I copy/pasted these directly. I also ran with size=10g because
>>> the 1g provides a really small sample set.
>>>
>>> There was one other difference, your results indicated fio 3.3 was used.
>>> My Bionic install has fio 3.1. I don't find that relevant because our
>>> goal is to compare before/after, which I haven't done yet.
>>>
>>
>> OK, the 50 MB/s was due to my particular .config. I had some expensive debug options
>> set in mm, fs and locking subsystems. Turning those off, I'm back up to the rated
>> speed of the Samsung NVMe device, so now we should have a clearer picture of the
>> performance that real users will see.
> 
> Oh, good! I'm especially glad because I was having a heck of a time
> reconfiguring the one machine I have available for this.
> 
>> Continuing on, then: running a before and after test, I don't see any significant
>> difference in the fio results:
> 
> Excerpting from below:
> 
>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
>>     read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>    cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
> 
> vs
> 
>> With patches applied:
>>     read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>    cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
> 
> Perfect results, not CPU limited, and full IOPS.
> 
> Curiously identical, so I trust you've checked that you measured
> both targets, but if so, I say it's good.
> 

Argh, copy-paste error in the email. The real "before" is ever so slightly
better, at 194K IOPS and 759 MB/s:

 $ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1715: Thu Nov 29 17:07:09 2018
   read: IOPS=194k, BW=759MiB/s (795MB/s)(1024MiB/1350msec)
    slat (nsec): min=1245, max=2812.7k, avg=1538.03, stdev=5519.61
    clat (usec): min=148, max=755, avg=326.85, stdev=18.13
     lat (usec): min=150, max=3483, avg=328.41, stdev=19.53
    clat percentiles (usec):
     |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
     | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
     | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
     | 99.00th=[  355], 99.50th=[  537], 99.90th=[  553], 99.95th=[  553],
     | 99.99th=[  619]
   bw (  KiB/s): min=767816, max=783096, per=99.84%, avg=775456.00, stdev=10804.59, samples=2
   iops        : min=191954, max=195774, avg=193864.00, stdev=2701.15, samples=2
  lat (usec)   : 250=0.09%, 500=99.30%, 750=0.61%, 1000=0.01%
  cpu          : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=759MiB/s (795MB/s), 759MiB/s-759MiB/s (795MB/s-795MB/s), io=1024MiB (1074MB), run=1350-1350msec

Disk stats (read/write):
  nvme0n1: ios=222853/0, merge=0/0, ticks=71410/0, in_queue=71935, util=100.00%

thanks,
Tom Talpey Nov. 30, 2018, 2:30 a.m. UTC | #11
On 11/29/2018 9:21 PM, John Hubbard wrote:
> On 11/29/18 6:18 PM, Tom Talpey wrote:
>> On 11/29/2018 8:39 PM, John Hubbard wrote:
>>> On 11/28/18 5:59 AM, Tom Talpey wrote:
>>>> On 11/27/2018 9:52 PM, John Hubbard wrote:
>>>>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>>>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>>>>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>>>> [...]
>>>>>> I'm super-limited here this week hardware-wise and have not been able
>>>>>> to try testing with the patched kernel.
>>>>>>
>>>>>> I was able to compare my earlier quick test with a Bionic 4.15 kernel
>>>>>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
>>>>>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
>>>>>> test, and without your change.
>>>>>>
>>>>>
>>>>> So just to double check (again): you are running fio with these parameters,
>>>>> right?
>>>>>
>>>>> [reader]
>>>>> direct=1
>>>>> ioengine=libaio
>>>>> blocksize=4096
>>>>> size=1g
>>>>> numjobs=1
>>>>> rw=read
>>>>> iodepth=64
>>>>
>>>> Correct, I copy/pasted these directly. I also ran with size=10g because
>>>> the 1g provides a really small sample set.
>>>>
>>>> There was one other difference, your results indicated fio 3.3 was used.
>>>> My Bionic install has fio 3.1. I don't find that relevant because our
>>>> goal is to compare before/after, which I haven't done yet.
>>>>
>>>
>>> OK, the 50 MB/s was due to my particular .config. I had some expensive debug options
>>> set in mm, fs and locking subsystems. Turning those off, I'm back up to the rated
>>> speed of the Samsung NVMe device, so now we should have a clearer picture of the
>>> performance that real users will see.
>>
>> Oh, good! I'm especially glad because I was having a heck of a time
>> reconfiguring the one machine I have available for this.
>>
>>> Continuing on, then: running a before and after test, I don't see any significant
>>> difference in the fio results:
>>
>> Excerpting from below:
>>
>>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
>>>       read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>>      cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>
>> vs
>>
>>> With patches applied:
>>>       read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>>      cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>
>> Perfect results, not CPU limited, and full IOPS.
>>
>> Curiously identical, so I trust you've checked that you measured
>> both targets, but if so, I say it's good.
>>
> 
> Argh, copy-paste error in the email. The real "before" is ever so slightly
> better, at 194K IOPS and 759 MB/s:

Definitely better - note the system CPU is lower, which is probably the
reason for the increased IOPS.

 >    cpu          : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73

Good result - a correct implementation, and faster.

Tom.


> 
>   $ fio ./experimental-fio.conf
> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
> fio-3.3
> Starting 1 process
> Jobs: 1 (f=1)
> reader: (groupid=0, jobs=1): err= 0: pid=1715: Thu Nov 29 17:07:09 2018
>     read: IOPS=194k, BW=759MiB/s (795MB/s)(1024MiB/1350msec)
>      slat (nsec): min=1245, max=2812.7k, avg=1538.03, stdev=5519.61
>      clat (usec): min=148, max=755, avg=326.85, stdev=18.13
>       lat (usec): min=150, max=3483, avg=328.41, stdev=19.53
>      clat percentiles (usec):
>       |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
>       | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
>       | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
>       | 99.00th=[  355], 99.50th=[  537], 99.90th=[  553], 99.95th=[  553],
>       | 99.99th=[  619]
>     bw (  KiB/s): min=767816, max=783096, per=99.84%, avg=775456.00, stdev=10804.59, samples=2
>     iops        : min=191954, max=195774, avg=193864.00, stdev=2701.15, samples=2
>    lat (usec)   : 250=0.09%, 500=99.30%, 750=0.61%, 1000=0.01%
>    cpu          : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73
>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>       issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=64
> 
> Run status group 0 (all jobs):
>     READ: bw=759MiB/s (795MB/s), 759MiB/s-759MiB/s (795MB/s-795MB/s), io=1024MiB (1074MB), run=1350-1350msec
> 
> Disk stats (read/write):
>    nvme0n1: ios=222853/0, merge=0/0, ticks=71410/0, in_queue=71935, util=100.00%
> 
> thanks,
>
John Hubbard Nov. 30, 2018, 3 a.m. UTC | #12
On 11/29/18 6:30 PM, Tom Talpey wrote:
> On 11/29/2018 9:21 PM, John Hubbard wrote:
>> On 11/29/18 6:18 PM, Tom Talpey wrote:
>>> On 11/29/2018 8:39 PM, John Hubbard wrote:
>>>> On 11/28/18 5:59 AM, Tom Talpey wrote:
>>>>> On 11/27/2018 9:52 PM, John Hubbard wrote:
>>>>>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>>>>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>>>>>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>>>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>>>>> [...]
>>> Excerpting from below:
>>>
>>>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
>>>>       read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>>>      cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>>
>>> vs
>>>
>>>> With patches applied:
>>>>       read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>>>      cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>>
>>> Perfect results, not CPU limited, and full IOPS.
>>>
>>> Curiously identical, so I trust you've checked that you measured
>>> both targets, but if so, I say it's good.
>>>
>>
>> Argh, copy-paste error in the email. The real "before" is ever so slightly
>> better, at 194K IOPS and 759 MB/s:
> 
> Definitely better - note the system CPU is lower, which is probably the
> reason for the increased IOPS.
> 
>>    cpu          : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73
> 
> Good result - a correct implementation, and faster.
> 

Thanks, Tom, I really appreciate your experience and help on what performance 
should look like here. (I'm sure you can guess that this is the first time 
I've worked with fio, heh.)

I'll send out a new, non-RFC patchset soon, then.

thanks,
Tom Talpey Nov. 30, 2018, 3:14 a.m. UTC | #13
On 11/29/2018 10:00 PM, John Hubbard wrote:
> On 11/29/18 6:30 PM, Tom Talpey wrote:
>> On 11/29/2018 9:21 PM, John Hubbard wrote:
>>> On 11/29/18 6:18 PM, Tom Talpey wrote:
>>>> On 11/29/2018 8:39 PM, John Hubbard wrote:
>>>>> On 11/28/18 5:59 AM, Tom Talpey wrote:
>>>>>> On 11/27/2018 9:52 PM, John Hubbard wrote:
>>>>>>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>>>>>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>>>>>>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>>>>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>>>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>>>>>> [...]
>>>> Excerpting from below:
>>>>
>>>>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
>>>>>        read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>>>>       cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>>>
>>>> vs
>>>>
>>>>> With patches applied:
>>>>>        read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>>>>       cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>>>
>>>> Perfect results, not CPU limited, and full IOPS.
>>>>
>>>> Curiously identical, so I trust you've checked that you measured
>>>> both targets, but if so, I say it's good.
>>>>
>>>
>>> Argh, copy-paste error in the email. The real "before" is ever so slightly
>>> better, at 194K IOPS and 759 MB/s:
>>
>> Definitely better - note the system CPU is lower, which is probably the
>> reason for the increased IOPS.
>>
>>>      cpu          : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73
>>
>> Good result - a correct implementation, and faster.
>>
> 
> Thanks, Tom, I really appreciate your experience and help on what performance
> should look like here. (I'm sure you can guess that this is the first time
> I've worked with fio, heh.)

No problem, happy to chip in. Feel free to add my

Tested-By: Tom Talpey <ttalpey@microsoft.com>

I know, that's not the personal email I'm posting from, but it's me.

I'll be hopefully trying the code with the Linux SMB client (cifs.ko)
next week, Long Li is implementing direct io in that and we'll see how
it helps.

Mainly, I'm looking forward to seeing this enable RDMA-to-DAX.

Tom.

> 
> I'll send out a new, non-RFC patchset soon, then.
> 
> thanks,
>