Message ID | 20181110085041.10071-1-jhubbard@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | RFC: gup+dma: tracking dma-pinned pages | expand |
John, thanks for the discussion at LPC. One of the concerns we raised however was the performance test. The numbers below are rather obviously tainted. I think we need to get a better baseline before concluding anything... Here's my main concern: On 11/10/2018 3:50 AM, john.hubbard@gmail.com wrote: > From: John Hubbard <jhubbard@nvidia.com> >... > ------------------------------------------------------ > WITHOUT the patch: > ------------------------------------------------------ > reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 > fio-3.3 > Starting 1 process > Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 00m:00s] > reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov 6 20:18:06 2018 > read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec) ~14000 4KB read IOPS is really, really low for an NVMe disk. > cpu : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72 CPU is obviously the limiting factor. At these IOPS, it should be far less. > ------------------------------------------------------ > OR, here's a better run WITH the patch applied, and you can see that this is nearly as good > as the "without" case: > ------------------------------------------------------ > > reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 > fio-3.3 > Starting 1 process > Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 00m:00s] > reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov 6 20:01:33 2018 > read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec) Similar low IOPS. > cpu : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73 Similar CPU saturation. > I get nearly 400,000 4KB IOPS on my tiny desktop, which has a 25W i7-7500 and a Samsung PM961 128GB NVMe (stock Bionic 4.15 kernel and fio version 3.1). Even then, the CPU saturates, so it's not necessarily a perfect test. I'd like to see your runs both get to "max" IOPS, i.e. CPU < 100%, and compare the CPU numbers. This would give the best comparison for making a decision. Can you confirm what type of hardware you're running this test on? CPU, memory speed and capacity, and NVMe device especially? Tom.
On 11/19/18 10:57 AM, Tom Talpey wrote: > John, thanks for the discussion at LPC. One of the concerns we > raised however was the performance test. The numbers below are > rather obviously tainted. I think we need to get a better baseline > before concluding anything... > > Here's my main concern: > Hi Tom, Thanks again for looking at this! > On 11/10/2018 3:50 AM, john.hubbard@gmail.com wrote: >> From: John Hubbard <jhubbard@nvidia.com> >> ... >> ------------------------------------------------------ >> WITHOUT the patch: >> ------------------------------------------------------ >> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 >> fio-3.3 >> Starting 1 process >> Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 00m:00s] >> reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov 6 20:18:06 2018 >> read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec) > > ~14000 4KB read IOPS is really, really low for an NVMe disk. Yes, but Jan Kara's original config file for fio is *intended* to highlight the get_user_pages/put_user_pages changes. It was *not* intended to get max performance, as you can see by the numjobs and direct IO parameters: cat fio.conf [reader] direct=1 ioengine=libaio blocksize=4096 size=1g numjobs=1 rw=read iodepth=64 So I'm thinking that this is not a "tainted" test, but rather, we're constraining things a lot with these choices. It's hard to find a good test config to run that allows decisions, but so far, I'm not really seeing anything that says "this is so bad that we can't afford to fix the brokenness." I think. After talking with you and reading this email, I did a bunch more test runs, varying the following fio parameters: -- direct -- numjobs -- iodepth ...with both the baseline 4.20-rc3 kernel, and with my patches applied. (btw, if anyone cares, I'll post a github link that has a complete, testable patchset--not ready for submission as such, but it works cleanly and will allow others to attempt to reproduce my results). What I'm seeing is that I can get 10x or better improvements in IOPS and BW, just by going to 10 threads and turning off direct IO--as expected. So in the end, I increased the number of threads, and also increased iodepth a bit. Test results below... > >> cpu : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72 > > CPU is obviously the limiting factor. At these IOPS, it should be far > less. >> ------------------------------------------------------ >> OR, here's a better run WITH the patch applied, and you can see that this is nearly as good >> as the "without" case: >> ------------------------------------------------------ >> >> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 >> fio-3.3 >> Starting 1 process >> Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 00m:00s] >> reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov 6 20:01:33 2018 >> read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec) > > Similar low IOPS. > >> cpu : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73 > > Similar CPU saturation. > >> > > I get nearly 400,000 4KB IOPS on my tiny desktop, which has a 25W > i7-7500 and a Samsung PM961 128GB NVMe (stock Bionic 4.15 kernel > and fio version 3.1). Even then, the CPU saturates, so it's not > necessarily a perfect test. I'd like to see your runs both get to > "max" IOPS, i.e. CPU < 100%, and compare the CPU numbers. This would > give the best comparison for making a decision. I can get to CPU < 100% by increasing to 10 or 20 threads, although it makes latency ever so much worse. > > Can you confirm what type of hardware you're running this test on? > CPU, memory speed and capacity, and NVMe device especially? > > Tom. Yes, it's a nice new system, I don't expect any strange perf problems: CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz (Intel X299 chipset) Block device: nvme-Samsung_SSD_970_EVO_250GB DRAM: 32 GB So, here's a comparison using 20 threads, direct IO, for the baseline vs. patched kernel (below). Highlights: -- IOPS are similar, around 60k. -- BW gets worse, dropping from 290 to 220 MB/s. -- CPU is well under 100%. -- latency is incredibly long, but...20 threads. Baseline: $ ./run.sh fio configuration: [reader] ioengine=libaio blocksize=4096 size=1g rw=read group_reporting iodepth=256 direct=1 numjobs=20 -------- Running fio: reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256 ... fio-3.3 Starting 20 processes Jobs: 4 (f=4): [_(8),R(2),_(2),R(1),_(1),R(1),_(5)][95.9%][r=244MiB/s,w=0KiB/s][r=62.5k,w=0 IOPS][eta 00m:03s] reader: (groupid=0, jobs=20): err= 0: pid=14499: Tue Nov 20 16:20:35 2018 read: IOPS=74.2k, BW=290MiB/s (304MB/s)(20.0GiB/70644msec) slat (usec): min=26, max=48167, avg=249.27, stdev=1200.02 clat (usec): min=42, max=147792, avg=67108.56, stdev=18062.46 lat (usec): min=103, max=147943, avg=67358.10, stdev=18109.75 clat percentiles (msec): | 1.00th=[ 21], 5.00th=[ 40], 10.00th=[ 41], 20.00th=[ 47], | 30.00th=[ 58], 40.00th=[ 65], 50.00th=[ 70], 60.00th=[ 75], | 70.00th=[ 79], 80.00th=[ 83], 90.00th=[ 89], 95.00th=[ 93], | 99.00th=[ 104], 99.50th=[ 109], 99.90th=[ 121], 99.95th=[ 125], | 99.99th=[ 134] bw ( KiB/s): min= 9712, max=46362, per=5.11%, avg=15164.99, stdev=2242.15, samples=2742 iops : min= 2428, max=11590, avg=3790.94, stdev=560.53, samples=2742 lat (usec) : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.98%, 50=20.44% lat (msec) : 100=76.95%, 250=1.61% cpu : usr=1.00%, sys=57.65%, ctx=158367, majf=0, minf=5284 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=5242880,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=256 Run status group 0 (all jobs): READ: bw=290MiB/s (304MB/s), 290MiB/s-290MiB/s (304MB/s-304MB/s), io=20.0GiB (21.5GB), run=70644-70644msec Disk stats (read/write): nvme0n1: ios=5240738/7, merge=0/7, ticks=1457727/5, in_queue=1547139, util=100.00% -------------------------------------------------------------- Patched: <redforge> fast_256GB $ ./run.sh fio configuration: [reader] ioengine=libaio blocksize=4096 size=1g rw=read group_reporting iodepth=256 direct=1 numjobs=20 -------- Running fio: reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256 ... fio-3.3 Starting 20 processes Jobs: 13 (f=8): [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0 IOPS][eta 00m:02s] reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018 read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec) slat (usec): min=26, max=50436, avg=337.21, stdev=1405.14 clat (usec): min=43, max=178839, avg=88963.96, stdev=21745.31 lat (usec): min=106, max=179041, avg=89301.43, stdev=21800.43 clat percentiles (msec): | 1.00th=[ 50], 5.00th=[ 53], 10.00th=[ 55], 20.00th=[ 68], | 30.00th=[ 79], 40.00th=[ 86], 50.00th=[ 93], 60.00th=[ 99], | 70.00th=[ 103], 80.00th=[ 108], 90.00th=[ 114], 95.00th=[ 121], | 99.00th=[ 134], 99.50th=[ 140], 99.90th=[ 150], 99.95th=[ 155], | 99.99th=[ 163] bw ( KiB/s): min= 4920, max=39733, per=5.07%, avg=11506.18, stdev=1540.18, samples=3650 iops : min= 1230, max= 9933, avg=2876.20, stdev=385.05, samples=3650 lat (usec) : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01% lat (usec) : 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.23%, 50=1.13% lat (msec) : 100=63.04%, 250=35.57% cpu : usr=0.65%, sys=58.07%, ctx=188963, majf=0, minf=5303 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=5242880,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=256 Run status group 0 (all jobs): READ: bw=222MiB/s (232MB/s), 222MiB/s-222MiB/s (232MB/s-232MB/s), io=20.0GiB (21.5GB), run=92385-92385msec Disk stats (read/write): nvme0n1: ios=5240550/7, merge=0/7, ticks=1513681/4, in_queue=1636411, util=100.00% Thoughts? thanks,
On 11/21/2018 1:09 AM, John Hubbard wrote: > On 11/19/18 10:57 AM, Tom Talpey wrote: >> ~14000 4KB read IOPS is really, really low for an NVMe disk. > > Yes, but Jan Kara's original config file for fio is *intended* to highlight > the get_user_pages/put_user_pages changes. It was *not* intended to get max > performance, as you can see by the numjobs and direct IO parameters: > > cat fio.conf > [reader] > direct=1 > ioengine=libaio > blocksize=4096 > size=1g > numjobs=1 > rw=read > iodepth=64 To be clear - I used those identical parameters, on my lower-spec machine, and got 400,000 4KB read IOPS. Those results are nearly 30x higher than yours! > So I'm thinking that this is not a "tainted" test, but rather, we're constraining > things a lot with these choices. It's hard to find a good test config to run that > allows decisions, but so far, I'm not really seeing anything that says "this > is so bad that we can't afford to fix the brokenness." I think. I'm not suggesting we tune the benchmark, I'm suggesting the results on your system are not meaningful since they are orders of magnitude low. And without meaningful data it's impossible to see the performance impact of the change... >> Can you confirm what type of hardware you're running this test on? >> CPU, memory speed and capacity, and NVMe device especially? >> >> Tom. > > Yes, it's a nice new system, I don't expect any strange perf problems: > > CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz > (Intel X299 chipset) > Block device: nvme-Samsung_SSD_970_EVO_250GB > DRAM: 32 GB The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS with a 4KB QD32 workload: https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs And the I7-7800X is a 6-core processor (12 hyperthreads). > So, here's a comparison using 20 threads, direct IO, for the baseline vs. > patched kernel (below). Highlights: > > -- IOPS are similar, around 60k. > -- BW gets worse, dropping from 290 to 220 MB/s. > -- CPU is well under 100%. > -- latency is incredibly long, but...20 threads. > > Baseline: > > $ ./run.sh > fio configuration: > [reader] > ioengine=libaio > blocksize=4096 > size=1g > rw=read > group_reporting > iodepth=256 > direct=1 > numjobs=20 Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets. That's going to cause tremendous queuing, and context switching, far outside of the get_user_pages() change. But even so, it only brings IOPS to 74.2K, which is still far short of the device's 200K spec. Comparing anyway: > Patched: > > -------- Running fio: > reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256 > ... > fio-3.3 > Starting 20 processes > Jobs: 13 (f=8): [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0 IOPS][eta 00m:02s] > reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018 > read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec) > ... > Thoughts? Concern - the 74.2K IOPS unpatched drops to 56.8K patched! What I'd really like to see is to go back to the original fio parameters (1 thread, 64 iodepth) and try to get a result that gets at least close to the speced 200K IOPS of the NVMe device. There seems to be something wrong with yours, currently. Then of course, the result with the patched get_user_pages, and compare whichever of IOPS or CPU% changes, and how much. If these are within a few percent, I agree it's good to go. If it's roughly 25% like the result just above, that's a rocky road. I can try this after the holiday on some basic hardware and might be able to scrounge up better. Can you post that github link? Tom.
On 11/21/18 8:49 AM, Tom Talpey wrote: > On 11/21/2018 1:09 AM, John Hubbard wrote: >> On 11/19/18 10:57 AM, Tom Talpey wrote: >>> ~14000 4KB read IOPS is really, really low for an NVMe disk. >> >> Yes, but Jan Kara's original config file for fio is *intended* to highlight >> the get_user_pages/put_user_pages changes. It was *not* intended to get max >> performance, as you can see by the numjobs and direct IO parameters: >> >> cat fio.conf >> [reader] >> direct=1 >> ioengine=libaio >> blocksize=4096 >> size=1g >> numjobs=1 >> rw=read >> iodepth=64 > > To be clear - I used those identical parameters, on my lower-spec > machine, and got 400,000 4KB read IOPS. Those results are nearly 30x > higher than yours! OK, then something really is wrong here... > >> So I'm thinking that this is not a "tainted" test, but rather, we're constraining >> things a lot with these choices. It's hard to find a good test config to run that >> allows decisions, but so far, I'm not really seeing anything that says "this >> is so bad that we can't afford to fix the brokenness." I think. > > I'm not suggesting we tune the benchmark, I'm suggesting the results > on your system are not meaningful since they are orders of magnitude > low. And without meaningful data it's impossible to see the performance > impact of the change... > >>> Can you confirm what type of hardware you're running this test on? >>> CPU, memory speed and capacity, and NVMe device especially? >>> >>> Tom. >> >> Yes, it's a nice new system, I don't expect any strange perf problems: >> >> CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz >> (Intel X299 chipset) >> Block device: nvme-Samsung_SSD_970_EVO_250GB >> DRAM: 32 GB > > The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS > with a 4KB QD32 workload: > > > https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs > > And the I7-7800X is a 6-core processor (12 hyperthreads). > >> So, here's a comparison using 20 threads, direct IO, for the baseline vs. >> patched kernel (below). Highlights: >> >> -- IOPS are similar, around 60k. >> -- BW gets worse, dropping from 290 to 220 MB/s. >> -- CPU is well under 100%. >> -- latency is incredibly long, but...20 threads. >> >> Baseline: >> >> $ ./run.sh >> fio configuration: >> [reader] >> ioengine=libaio >> blocksize=4096 >> size=1g >> rw=read >> group_reporting >> iodepth=256 >> direct=1 >> numjobs=20 > > Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets. > That's going to cause tremendous queuing, and context switching, far > outside of the get_user_pages() change. > > But even so, it only brings IOPS to 74.2K, which is still far short of > the device's 200K spec. > > Comparing anyway: > > >> Patched: >> >> -------- Running fio: >> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256 >> ... >> fio-3.3 >> Starting 20 processes >> Jobs: 13 (f=8): [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0 IOPS][eta 00m:02s] >> reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018 >> read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec) >> ... >> Thoughts? > > Concern - the 74.2K IOPS unpatched drops to 56.8K patched! ACK. :) > > What I'd really like to see is to go back to the original fio parameters > (1 thread, 64 iodepth) and try to get a result that gets at least close > to the speced 200K IOPS of the NVMe device. There seems to be something > wrong with yours, currently. I'll dig into what has gone wrong with the test. I see fio putting data files in the right place, so the obvious "using the wrong drive" is (probably) not it. Even though it really feels like that sort of thing. We'll see. > > Then of course, the result with the patched get_user_pages, and > compare whichever of IOPS or CPU% changes, and how much. > > If these are within a few percent, I agree it's good to go. If it's > roughly 25% like the result just above, that's a rocky road. > > I can try this after the holiday on some basic hardware and might > be able to scrounge up better. Can you post that github link? > Here: git@github.com:johnhubbard/linux (branch: gup_dma_testing)
On 11/21/2018 5:06 PM, John Hubbard wrote: > On 11/21/18 8:49 AM, Tom Talpey wrote: >> On 11/21/2018 1:09 AM, John Hubbard wrote: >>> On 11/19/18 10:57 AM, Tom Talpey wrote: >>>> ~14000 4KB read IOPS is really, really low for an NVMe disk. >>> >>> Yes, but Jan Kara's original config file for fio is *intended* to highlight >>> the get_user_pages/put_user_pages changes. It was *not* intended to get max >>> performance, as you can see by the numjobs and direct IO parameters: >>> >>> cat fio.conf >>> [reader] >>> direct=1 >>> ioengine=libaio >>> blocksize=4096 >>> size=1g >>> numjobs=1 >>> rw=read >>> iodepth=64 >> >> To be clear - I used those identical parameters, on my lower-spec >> machine, and got 400,000 4KB read IOPS. Those results are nearly 30x >> higher than yours! > > OK, then something really is wrong here... > >> >>> So I'm thinking that this is not a "tainted" test, but rather, we're constraining >>> things a lot with these choices. It's hard to find a good test config to run that >>> allows decisions, but so far, I'm not really seeing anything that says "this >>> is so bad that we can't afford to fix the brokenness." I think. >> >> I'm not suggesting we tune the benchmark, I'm suggesting the results >> on your system are not meaningful since they are orders of magnitude >> low. And without meaningful data it's impossible to see the performance >> impact of the change... >> >>>> Can you confirm what type of hardware you're running this test on? >>>> CPU, memory speed and capacity, and NVMe device especially? >>>> >>>> Tom. >>> >>> Yes, it's a nice new system, I don't expect any strange perf problems: >>> >>> CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz >>> (Intel X299 chipset) >>> Block device: nvme-Samsung_SSD_970_EVO_250GB >>> DRAM: 32 GB >> >> The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS >> with a 4KB QD32 workload: >> >> >> https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs >> >> And the I7-7800X is a 6-core processor (12 hyperthreads). >> >>> So, here's a comparison using 20 threads, direct IO, for the baseline vs. >>> patched kernel (below). Highlights: >>> >>> -- IOPS are similar, around 60k. >>> -- BW gets worse, dropping from 290 to 220 MB/s. >>> -- CPU is well under 100%. >>> -- latency is incredibly long, but...20 threads. >>> >>> Baseline: >>> >>> $ ./run.sh >>> fio configuration: >>> [reader] >>> ioengine=libaio >>> blocksize=4096 >>> size=1g >>> rw=read >>> group_reporting >>> iodepth=256 >>> direct=1 >>> numjobs=20 >> >> Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets. >> That's going to cause tremendous queuing, and context switching, far >> outside of the get_user_pages() change. >> >> But even so, it only brings IOPS to 74.2K, which is still far short of >> the device's 200K spec. >> >> Comparing anyway: >> >> >>> Patched: >>> >>> -------- Running fio: >>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256 >>> ... >>> fio-3.3 >>> Starting 20 processes >>> Jobs: 13 (f=8): [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0 IOPS][eta 00m:02s] >>> reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018 >>> read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec) >>> ... >>> Thoughts? >> >> Concern - the 74.2K IOPS unpatched drops to 56.8K patched! > > ACK. :) > >> >> What I'd really like to see is to go back to the original fio parameters >> (1 thread, 64 iodepth) and try to get a result that gets at least close >> to the speced 200K IOPS of the NVMe device. There seems to be something >> wrong with yours, currently. > > I'll dig into what has gone wrong with the test. I see fio putting data files > in the right place, so the obvious "using the wrong drive" is (probably) > not it. Even though it really feels like that sort of thing. We'll see. > >> >> Then of course, the result with the patched get_user_pages, and >> compare whichever of IOPS or CPU% changes, and how much. >> >> If these are within a few percent, I agree it's good to go. If it's >> roughly 25% like the result just above, that's a rocky road. >> >> I can try this after the holiday on some basic hardware and might >> be able to scrounge up better. Can you post that github link? >> > > Here: > > git@github.com:johnhubbard/linux (branch: gup_dma_testing) I'm super-limited here this week hardware-wise and have not been able to try testing with the patched kernel. I was able to compare my earlier quick test with a Bionic 4.15 kernel (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick test, and without your change. Say, that branch reports it has not had a commit since June 30. Is that the right one? What about gup_dma_for_lpc_2018? Tom.
On 11/27/18 5:21 PM, Tom Talpey wrote: > On 11/21/2018 5:06 PM, John Hubbard wrote: >> On 11/21/18 8:49 AM, Tom Talpey wrote: >>> On 11/21/2018 1:09 AM, John Hubbard wrote: >>>> On 11/19/18 10:57 AM, Tom Talpey wrote: [...] >>> >>> What I'd really like to see is to go back to the original fio parameters >>> (1 thread, 64 iodepth) and try to get a result that gets at least close >>> to the speced 200K IOPS of the NVMe device. There seems to be something >>> wrong with yours, currently. >> >> I'll dig into what has gone wrong with the test. I see fio putting data files >> in the right place, so the obvious "using the wrong drive" is (probably) >> not it. Even though it really feels like that sort of thing. We'll see. >> >>> >>> Then of course, the result with the patched get_user_pages, and >>> compare whichever of IOPS or CPU% changes, and how much. >>> >>> If these are within a few percent, I agree it's good to go. If it's >>> roughly 25% like the result just above, that's a rocky road. >>> >>> I can try this after the holiday on some basic hardware and might >>> be able to scrounge up better. Can you post that github link? >>> >> >> Here: >> >> git@github.com:johnhubbard/linux (branch: gup_dma_testing) > > I'm super-limited here this week hardware-wise and have not been able > to try testing with the patched kernel. > > I was able to compare my earlier quick test with a Bionic 4.15 kernel > (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to > ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick > test, and without your change. > So just to double check (again): you are running fio with these parameters, right? [reader] direct=1 ioengine=libaio blocksize=4096 size=1g numjobs=1 rw=read iodepth=64 > Say, that branch reports it has not had a commit since June 30. Is that > the right one? What about gup_dma_for_lpc_2018? > That's the right branch, but the AuthorDate for the head commit (only) somehow got stuck in the past. I just now amended that patch with a new date and pushed it, so the head commit now shows Nov 27: https://github.com/johnhubbard/linux/commits/gup_dma_testing The actual code is the same, though. (It is still based on Nov 19th's f2ce1065e767 commit.) thanks,
On 11/27/2018 9:52 PM, John Hubbard wrote: > On 11/27/18 5:21 PM, Tom Talpey wrote: >> On 11/21/2018 5:06 PM, John Hubbard wrote: >>> On 11/21/18 8:49 AM, Tom Talpey wrote: >>>> On 11/21/2018 1:09 AM, John Hubbard wrote: >>>>> On 11/19/18 10:57 AM, Tom Talpey wrote: > [...] >>>> >>>> What I'd really like to see is to go back to the original fio parameters >>>> (1 thread, 64 iodepth) and try to get a result that gets at least close >>>> to the speced 200K IOPS of the NVMe device. There seems to be something >>>> wrong with yours, currently. >>> >>> I'll dig into what has gone wrong with the test. I see fio putting data files >>> in the right place, so the obvious "using the wrong drive" is (probably) >>> not it. Even though it really feels like that sort of thing. We'll see. >>> >>>> >>>> Then of course, the result with the patched get_user_pages, and >>>> compare whichever of IOPS or CPU% changes, and how much. >>>> >>>> If these are within a few percent, I agree it's good to go. If it's >>>> roughly 25% like the result just above, that's a rocky road. >>>> >>>> I can try this after the holiday on some basic hardware and might >>>> be able to scrounge up better. Can you post that github link? >>>> >>> >>> Here: >>> >>> git@github.com:johnhubbard/linux (branch: gup_dma_testing) >> >> I'm super-limited here this week hardware-wise and have not been able >> to try testing with the patched kernel. >> >> I was able to compare my earlier quick test with a Bionic 4.15 kernel >> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to >> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick >> test, and without your change. >> > > So just to double check (again): you are running fio with these parameters, > right? > > [reader] > direct=1 > ioengine=libaio > blocksize=4096 > size=1g > numjobs=1 > rw=read > iodepth=64 Correct, I copy/pasted these directly. I also ran with size=10g because the 1g provides a really small sample set. There was one other difference, your results indicated fio 3.3 was used. My Bionic install has fio 3.1. I don't find that relevant because our goal is to compare before/after, which I haven't done yet. Tom. > > > >> Say, that branch reports it has not had a commit since June 30. Is that >> the right one? What about gup_dma_for_lpc_2018? >> > > That's the right branch, but the AuthorDate for the head commit (only) somehow > got stuck in the past. I just now amended that patch with a new date and pushed > it, so the head commit now shows Nov 27: > > https://github.com/johnhubbard/linux/commits/gup_dma_testing > > > The actual code is the same, though. (It is still based on Nov 19th's f2ce1065e767 > commit.) > > > thanks, >
On 11/28/18 5:59 AM, Tom Talpey wrote: > On 11/27/2018 9:52 PM, John Hubbard wrote: >> On 11/27/18 5:21 PM, Tom Talpey wrote: >>> On 11/21/2018 5:06 PM, John Hubbard wrote: >>>> On 11/21/18 8:49 AM, Tom Talpey wrote: >>>>> On 11/21/2018 1:09 AM, John Hubbard wrote: >>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote: >> [...] >>> I'm super-limited here this week hardware-wise and have not been able >>> to try testing with the patched kernel. >>> >>> I was able to compare my earlier quick test with a Bionic 4.15 kernel >>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to >>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick >>> test, and without your change. >>> >> >> So just to double check (again): you are running fio with these parameters, >> right? >> >> [reader] >> direct=1 >> ioengine=libaio >> blocksize=4096 >> size=1g >> numjobs=1 >> rw=read >> iodepth=64 > > Correct, I copy/pasted these directly. I also ran with size=10g because > the 1g provides a really small sample set. > > There was one other difference, your results indicated fio 3.3 was used. > My Bionic install has fio 3.1. I don't find that relevant because our > goal is to compare before/after, which I haven't done yet. > OK, the 50 MB/s was due to my particular .config. I had some expensive debug options set in mm, fs and locking subsystems. Turning those off, I'm back up to the rated speed of the Samsung NVMe device, so now we should have a clearer picture of the performance that real users will see. Continuing on, then: running a before and after test, I don't see any significant difference in the fio results: fio.conf: [reader] direct=1 ioengine=libaio blocksize=4096 size=1g numjobs=1 rw=read iodepth=64 --------------------------------------------------------- Baseline 4.20.0-rc3 (commit f2ce1065e767), as before: $ fio ./experimental-fio.conf reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1) reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018 read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46 clat (usec): min=162, max=12247, avg=330.00, stdev=185.55 lat (usec): min=165, max=12253, avg=331.68, stdev=185.69 clat percentiles (usec): | 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326], | 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326], | 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326], | 99.00th=[ 379], 99.50th=[ 594], 99.90th=[ 603], 99.95th=[ 611], | 99.99th=[12125] bw ( KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, stdev=22112.64, samples=2 iops : min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2 lat (usec) : 250=0.08%, 500=99.30%, 750=0.59% lat (msec) : 20=0.02% cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1360-1360msec Disk stats (read/write): nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00% --------------------------------------------------------- With patches applied: <redforge> fast_256GB $ fio ./experimental-fio.conf reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1) reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018 read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46 clat (usec): min=162, max=12247, avg=330.00, stdev=185.55 lat (usec): min=165, max=12253, avg=331.68, stdev=185.69 clat percentiles (usec): | 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326], | 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326], | 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326], | 99.00th=[ 379], 99.50th=[ 594], 99.90th=[ 603], 99.95th=[ 611], | 99.99th=[12125] bw ( KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, stdev=22112.64, samples=2 iops : min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2 lat (usec) : 250=0.08%, 500=99.30%, 750=0.59% lat (msec) : 20=0.02% cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1360-1360msec Disk stats (read/write): nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00% thanks,
On 11/29/2018 8:39 PM, John Hubbard wrote: > On 11/28/18 5:59 AM, Tom Talpey wrote: >> On 11/27/2018 9:52 PM, John Hubbard wrote: >>> On 11/27/18 5:21 PM, Tom Talpey wrote: >>>> On 11/21/2018 5:06 PM, John Hubbard wrote: >>>>> On 11/21/18 8:49 AM, Tom Talpey wrote: >>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote: >>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote: >>> [...] >>>> I'm super-limited here this week hardware-wise and have not been able >>>> to try testing with the patched kernel. >>>> >>>> I was able to compare my earlier quick test with a Bionic 4.15 kernel >>>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to >>>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick >>>> test, and without your change. >>>> >>> >>> So just to double check (again): you are running fio with these parameters, >>> right? >>> >>> [reader] >>> direct=1 >>> ioengine=libaio >>> blocksize=4096 >>> size=1g >>> numjobs=1 >>> rw=read >>> iodepth=64 >> >> Correct, I copy/pasted these directly. I also ran with size=10g because >> the 1g provides a really small sample set. >> >> There was one other difference, your results indicated fio 3.3 was used. >> My Bionic install has fio 3.1. I don't find that relevant because our >> goal is to compare before/after, which I haven't done yet. >> > > OK, the 50 MB/s was due to my particular .config. I had some expensive debug options > set in mm, fs and locking subsystems. Turning those off, I'm back up to the rated > speed of the Samsung NVMe device, so now we should have a clearer picture of the > performance that real users will see. Oh, good! I'm especially glad because I was having a heck of a time reconfiguring the one machine I have available for this. > Continuing on, then: running a before and after test, I don't see any significant > difference in the fio results: Excerpting from below: > Baseline 4.20.0-rc3 (commit f2ce1065e767), as before: > read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) > cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 vs > With patches applied: > read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) > cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 Perfect results, not CPU limited, and full IOPS. Curiously identical, so I trust you've checked that you measured both targets, but if so, I say it's good. Tom. > > fio.conf: > > [reader] > direct=1 > ioengine=libaio > blocksize=4096 > size=1g > numjobs=1 > rw=read > iodepth=64 > > --------------------------------------------------------- > Baseline 4.20.0-rc3 (commit f2ce1065e767), as before: > > $ fio ./experimental-fio.conf > reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 > fio-3.3 > Starting 1 process > Jobs: 1 (f=1) > reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018 > read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) > slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46 > clat (usec): min=162, max=12247, avg=330.00, stdev=185.55 > lat (usec): min=165, max=12253, avg=331.68, stdev=185.69 > clat percentiles (usec): > | 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326], > | 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326], > | 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326], > | 99.00th=[ 379], 99.50th=[ 594], 99.90th=[ 603], 99.95th=[ 611], > | 99.99th=[12125] > bw ( KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, stdev=22112.64, samples=2 > iops : min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2 > lat (usec) : 250=0.08%, 500=99.30%, 750=0.59% > lat (msec) : 20=0.02% > cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% > issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=64 > > Run status group 0 (all jobs): > READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1360-1360msec > > Disk stats (read/write): > nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00% > > --------------------------------------------------------- > With patches applied: > > <redforge> fast_256GB $ fio ./experimental-fio.conf > reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 > fio-3.3 > Starting 1 process > Jobs: 1 (f=1) > reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018 > read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) > slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46 > clat (usec): min=162, max=12247, avg=330.00, stdev=185.55 > lat (usec): min=165, max=12253, avg=331.68, stdev=185.69 > clat percentiles (usec): > | 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326], > | 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326], > | 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326], > | 99.00th=[ 379], 99.50th=[ 594], 99.90th=[ 603], 99.95th=[ 611], > | 99.99th=[12125] > bw ( KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, stdev=22112.64, samples=2 > iops : min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2 > lat (usec) : 250=0.08%, 500=99.30%, 750=0.59% > lat (msec) : 20=0.02% > cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% > issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=64 > > Run status group 0 (all jobs): > READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1360-1360msec > > Disk stats (read/write): > nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00% > > > thanks, >
On 11/29/18 6:18 PM, Tom Talpey wrote: > On 11/29/2018 8:39 PM, John Hubbard wrote: >> On 11/28/18 5:59 AM, Tom Talpey wrote: >>> On 11/27/2018 9:52 PM, John Hubbard wrote: >>>> On 11/27/18 5:21 PM, Tom Talpey wrote: >>>>> On 11/21/2018 5:06 PM, John Hubbard wrote: >>>>>> On 11/21/18 8:49 AM, Tom Talpey wrote: >>>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote: >>>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote: >>>> [...] >>>>> I'm super-limited here this week hardware-wise and have not been able >>>>> to try testing with the patched kernel. >>>>> >>>>> I was able to compare my earlier quick test with a Bionic 4.15 kernel >>>>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to >>>>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick >>>>> test, and without your change. >>>>> >>>> >>>> So just to double check (again): you are running fio with these parameters, >>>> right? >>>> >>>> [reader] >>>> direct=1 >>>> ioengine=libaio >>>> blocksize=4096 >>>> size=1g >>>> numjobs=1 >>>> rw=read >>>> iodepth=64 >>> >>> Correct, I copy/pasted these directly. I also ran with size=10g because >>> the 1g provides a really small sample set. >>> >>> There was one other difference, your results indicated fio 3.3 was used. >>> My Bionic install has fio 3.1. I don't find that relevant because our >>> goal is to compare before/after, which I haven't done yet. >>> >> >> OK, the 50 MB/s was due to my particular .config. I had some expensive debug options >> set in mm, fs and locking subsystems. Turning those off, I'm back up to the rated >> speed of the Samsung NVMe device, so now we should have a clearer picture of the >> performance that real users will see. > > Oh, good! I'm especially glad because I was having a heck of a time > reconfiguring the one machine I have available for this. > >> Continuing on, then: running a before and after test, I don't see any significant >> difference in the fio results: > > Excerpting from below: > >> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before: >> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) >> cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 > > vs > >> With patches applied: >> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) >> cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 > > Perfect results, not CPU limited, and full IOPS. > > Curiously identical, so I trust you've checked that you measured > both targets, but if so, I say it's good. > Argh, copy-paste error in the email. The real "before" is ever so slightly better, at 194K IOPS and 759 MB/s: $ fio ./experimental-fio.conf reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1) reader: (groupid=0, jobs=1): err= 0: pid=1715: Thu Nov 29 17:07:09 2018 read: IOPS=194k, BW=759MiB/s (795MB/s)(1024MiB/1350msec) slat (nsec): min=1245, max=2812.7k, avg=1538.03, stdev=5519.61 clat (usec): min=148, max=755, avg=326.85, stdev=18.13 lat (usec): min=150, max=3483, avg=328.41, stdev=19.53 clat percentiles (usec): | 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326], | 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326], | 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326], | 99.00th=[ 355], 99.50th=[ 537], 99.90th=[ 553], 99.95th=[ 553], | 99.99th=[ 619] bw ( KiB/s): min=767816, max=783096, per=99.84%, avg=775456.00, stdev=10804.59, samples=2 iops : min=191954, max=195774, avg=193864.00, stdev=2701.15, samples=2 lat (usec) : 250=0.09%, 500=99.30%, 750=0.61%, 1000=0.01% cpu : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=759MiB/s (795MB/s), 759MiB/s-759MiB/s (795MB/s-795MB/s), io=1024MiB (1074MB), run=1350-1350msec Disk stats (read/write): nvme0n1: ios=222853/0, merge=0/0, ticks=71410/0, in_queue=71935, util=100.00% thanks,
On 11/29/2018 9:21 PM, John Hubbard wrote: > On 11/29/18 6:18 PM, Tom Talpey wrote: >> On 11/29/2018 8:39 PM, John Hubbard wrote: >>> On 11/28/18 5:59 AM, Tom Talpey wrote: >>>> On 11/27/2018 9:52 PM, John Hubbard wrote: >>>>> On 11/27/18 5:21 PM, Tom Talpey wrote: >>>>>> On 11/21/2018 5:06 PM, John Hubbard wrote: >>>>>>> On 11/21/18 8:49 AM, Tom Talpey wrote: >>>>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote: >>>>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote: >>>>> [...] >>>>>> I'm super-limited here this week hardware-wise and have not been able >>>>>> to try testing with the patched kernel. >>>>>> >>>>>> I was able to compare my earlier quick test with a Bionic 4.15 kernel >>>>>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to >>>>>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick >>>>>> test, and without your change. >>>>>> >>>>> >>>>> So just to double check (again): you are running fio with these parameters, >>>>> right? >>>>> >>>>> [reader] >>>>> direct=1 >>>>> ioengine=libaio >>>>> blocksize=4096 >>>>> size=1g >>>>> numjobs=1 >>>>> rw=read >>>>> iodepth=64 >>>> >>>> Correct, I copy/pasted these directly. I also ran with size=10g because >>>> the 1g provides a really small sample set. >>>> >>>> There was one other difference, your results indicated fio 3.3 was used. >>>> My Bionic install has fio 3.1. I don't find that relevant because our >>>> goal is to compare before/after, which I haven't done yet. >>>> >>> >>> OK, the 50 MB/s was due to my particular .config. I had some expensive debug options >>> set in mm, fs and locking subsystems. Turning those off, I'm back up to the rated >>> speed of the Samsung NVMe device, so now we should have a clearer picture of the >>> performance that real users will see. >> >> Oh, good! I'm especially glad because I was having a heck of a time >> reconfiguring the one machine I have available for this. >> >>> Continuing on, then: running a before and after test, I don't see any significant >>> difference in the fio results: >> >> Excerpting from below: >> >>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before: >>> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) >>> cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 >> >> vs >> >>> With patches applied: >>> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) >>> cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 >> >> Perfect results, not CPU limited, and full IOPS. >> >> Curiously identical, so I trust you've checked that you measured >> both targets, but if so, I say it's good. >> > > Argh, copy-paste error in the email. The real "before" is ever so slightly > better, at 194K IOPS and 759 MB/s: Definitely better - note the system CPU is lower, which is probably the reason for the increased IOPS. > cpu : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73 Good result - a correct implementation, and faster. Tom. > > $ fio ./experimental-fio.conf > reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 > fio-3.3 > Starting 1 process > Jobs: 1 (f=1) > reader: (groupid=0, jobs=1): err= 0: pid=1715: Thu Nov 29 17:07:09 2018 > read: IOPS=194k, BW=759MiB/s (795MB/s)(1024MiB/1350msec) > slat (nsec): min=1245, max=2812.7k, avg=1538.03, stdev=5519.61 > clat (usec): min=148, max=755, avg=326.85, stdev=18.13 > lat (usec): min=150, max=3483, avg=328.41, stdev=19.53 > clat percentiles (usec): > | 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326], > | 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326], > | 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326], > | 99.00th=[ 355], 99.50th=[ 537], 99.90th=[ 553], 99.95th=[ 553], > | 99.99th=[ 619] > bw ( KiB/s): min=767816, max=783096, per=99.84%, avg=775456.00, stdev=10804.59, samples=2 > iops : min=191954, max=195774, avg=193864.00, stdev=2701.15, samples=2 > lat (usec) : 250=0.09%, 500=99.30%, 750=0.61%, 1000=0.01% > cpu : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% > issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=64 > > Run status group 0 (all jobs): > READ: bw=759MiB/s (795MB/s), 759MiB/s-759MiB/s (795MB/s-795MB/s), io=1024MiB (1074MB), run=1350-1350msec > > Disk stats (read/write): > nvme0n1: ios=222853/0, merge=0/0, ticks=71410/0, in_queue=71935, util=100.00% > > thanks, >
On 11/29/18 6:30 PM, Tom Talpey wrote: > On 11/29/2018 9:21 PM, John Hubbard wrote: >> On 11/29/18 6:18 PM, Tom Talpey wrote: >>> On 11/29/2018 8:39 PM, John Hubbard wrote: >>>> On 11/28/18 5:59 AM, Tom Talpey wrote: >>>>> On 11/27/2018 9:52 PM, John Hubbard wrote: >>>>>> On 11/27/18 5:21 PM, Tom Talpey wrote: >>>>>>> On 11/21/2018 5:06 PM, John Hubbard wrote: >>>>>>>> On 11/21/18 8:49 AM, Tom Talpey wrote: >>>>>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote: >>>>>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote: >>>>>> [...] >>> Excerpting from below: >>> >>>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before: >>>> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) >>>> cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 >>> >>> vs >>> >>>> With patches applied: >>>> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) >>>> cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 >>> >>> Perfect results, not CPU limited, and full IOPS. >>> >>> Curiously identical, so I trust you've checked that you measured >>> both targets, but if so, I say it's good. >>> >> >> Argh, copy-paste error in the email. The real "before" is ever so slightly >> better, at 194K IOPS and 759 MB/s: > > Definitely better - note the system CPU is lower, which is probably the > reason for the increased IOPS. > >> cpu : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73 > > Good result - a correct implementation, and faster. > Thanks, Tom, I really appreciate your experience and help on what performance should look like here. (I'm sure you can guess that this is the first time I've worked with fio, heh.) I'll send out a new, non-RFC patchset soon, then. thanks,
On 11/29/2018 10:00 PM, John Hubbard wrote: > On 11/29/18 6:30 PM, Tom Talpey wrote: >> On 11/29/2018 9:21 PM, John Hubbard wrote: >>> On 11/29/18 6:18 PM, Tom Talpey wrote: >>>> On 11/29/2018 8:39 PM, John Hubbard wrote: >>>>> On 11/28/18 5:59 AM, Tom Talpey wrote: >>>>>> On 11/27/2018 9:52 PM, John Hubbard wrote: >>>>>>> On 11/27/18 5:21 PM, Tom Talpey wrote: >>>>>>>> On 11/21/2018 5:06 PM, John Hubbard wrote: >>>>>>>>> On 11/21/18 8:49 AM, Tom Talpey wrote: >>>>>>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote: >>>>>>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote: >>>>>>> [...] >>>> Excerpting from below: >>>> >>>>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before: >>>>> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) >>>>> cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 >>>> >>>> vs >>>> >>>>> With patches applied: >>>>> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) >>>>> cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 >>>> >>>> Perfect results, not CPU limited, and full IOPS. >>>> >>>> Curiously identical, so I trust you've checked that you measured >>>> both targets, but if so, I say it's good. >>>> >>> >>> Argh, copy-paste error in the email. The real "before" is ever so slightly >>> better, at 194K IOPS and 759 MB/s: >> >> Definitely better - note the system CPU is lower, which is probably the >> reason for the increased IOPS. >> >>> cpu : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73 >> >> Good result - a correct implementation, and faster. >> > > Thanks, Tom, I really appreciate your experience and help on what performance > should look like here. (I'm sure you can guess that this is the first time > I've worked with fio, heh.) No problem, happy to chip in. Feel free to add my Tested-By: Tom Talpey <ttalpey@microsoft.com> I know, that's not the personal email I'm posting from, but it's me. I'll be hopefully trying the code with the Linux SMB client (cifs.ko) next week, Long Li is implementing direct io in that and we'll see how it helps. Mainly, I'm looking forward to seeing this enable RDMA-to-DAX. Tom. > > I'll send out a new, non-RFC patchset soon, then. > > thanks, >
From: John Hubbard <jhubbard@nvidia.com> Hi, here is fodder for conversation during LPC. Changes since v1: a) Uses a simpler set/clear/test bit approach in the page flags. b) Added some initial performance results in the cover letter here, below. c) Rebased to latest linux.git d) Puts pages back on the LRU when its done with them. Here is an updated proposal for tracking pages that have had get_user_pages*() called on them. This is in support of fixing the problem discussed in [1]. This RFC only shows how to set up a reliable PageDmaPinned flag. What to *do* with that flag is left for a later discussion. The sequence would be: -- apply patches 1-2, -- convert the rest of the subsystems to call put_user_page*(). Patch #3 is an example of that. It converts infiniband. -- apply patches 4-6, -- apply more patches, to actually use the new PageDmaPinned flag. One question up front was, "how do we ensure that either put_user_page() or put_page() are called, depending on whether the page came from get_user_pages() or not?". From this series, you can see that: -- It's possible to assert within put_page(), that we are probably in the right place. In practice this assertion definitely helps. -- In the other direction, if put_user_page() is called when put_page() should have been used instead, then a "clear" report of LRU list corruption shows up reliably, because the first thing put_user_page() attempts is to decrement the lru's list.prev pointer, and so you'll see pointer values that are one less than an aligned pointer value. This is not great, but it's usable. So I think that the conversion will turn into an exercise of trying to get code coverage, and that should work out. I have lots of other patches, not shown here, in various stages of polish, to convert enough of the kernel in order to run fio [2]. I did that, and got some rather noisy performance results. Here they are again: Performance notes: 1. These fio results are noisy. The std deviation is large enough that some of this could be noise. In order to highlight that, I did 5 runs each of with, and without the patch, and while there is definitely a performance drop on average, it's also true that there is overlap in the results. In other words, the best "with patch" run is about the same as the worst "without patch" run. 2. Initial profiling shows that we're adding about 1% total to the this particular test run...I think. I may have to narrow this down some more, but I don't seem to see any real lock contention. Hints or ideas on measurement methods are welcome, btw. -- 0.59% in put_user_page -- 0.2% (or 0.54%, depending on how you read the perf out below) in get_user_pages_fast: 1.36%--iov_iter_get_pages | --1.27%--get_user_pages_fast | --0.54%--pin_page_for_dma 0.59%--put_user_page 1.34% 0.03% fio [kernel.vmlinux] [k] _raw_spin_lock 0.95% 0.55% fio [kernel.vmlinux] [k] do_raw_spin_lock 0.17% 0.03% fio [kernel.vmlinux] [k] isolate_lru_page 0.06% 0.00% fio [kernel.vmlinux] [k] putback_lru_page 4. Here's some raw fio data: one run without the patch, and two with the patch: ------------------------------------------------------ WITHOUT the patch: ------------------------------------------------------ reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 00m:00s] reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov 6 20:18:06 2018 read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec) slat (usec): min=25, max=4870, avg=68.19, stdev=85.21 clat (usec): min=74, max=19814, avg=4525.40, stdev=1844.03 lat (usec): min=183, max=19927, avg=4593.69, stdev=1866.65 clat percentiles (usec): | 1.00th=[ 3687], 5.00th=[ 3720], 10.00th=[ 3720], 20.00th=[ 3752], | 30.00th=[ 3752], 40.00th=[ 3752], 50.00th=[ 3752], 60.00th=[ 3785], | 70.00th=[ 4178], 80.00th=[ 4490], 90.00th=[ 6652], 95.00th=[ 8225], | 99.00th=[13173], 99.50th=[14353], 99.90th=[16581], 99.95th=[17171], | 99.99th=[18220] bw ( KiB/s): min=49920, max=59320, per=100.00%, avg=55742.24, stdev=2224.20, samples=37 iops : min=12480, max=14830, avg=13935.35, stdev=556.05, samples=37 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=68.78%, 10=28.14%, 20=3.08% cpu : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=54.4MiB/s (57.0MB/s), 54.4MiB/s-54.4MiB/s (57.0MB/s-57.0MB/s), io=1024MiB (1074MB), run=18826-18826msec Disk stats (read/write): nvme0n1: ios=259490/1, merge=0/0, ticks=14822/0, in_queue=19241, util=100.00% ------------------------------------------------------ With the patch applied: ------------------------------------------------------ reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=51.2MiB/s,w=0KiB/s][r=13.1k,w=0 IOPS][eta 00m:00s] reader: (groupid=0, jobs=1): err= 0: pid=2568: Tue Nov 6 20:03:50 2018 read: IOPS=12.8k, BW=50.1MiB/s (52.5MB/s)(1024MiB/20453msec) slat (usec): min=33, max=4365, avg=74.05, stdev=85.79 clat (usec): min=39, max=19818, avg=4916.61, stdev=1961.79 lat (usec): min=100, max=20002, avg=4990.78, stdev=1985.23 clat percentiles (usec): | 1.00th=[ 4047], 5.00th=[ 4080], 10.00th=[ 4080], 20.00th=[ 4080], | 30.00th=[ 4113], 40.00th=[ 4113], 50.00th=[ 4113], 60.00th=[ 4146], | 70.00th=[ 4178], 80.00th=[ 4817], 90.00th=[ 7308], 95.00th=[ 8717], | 99.00th=[14091], 99.50th=[15270], 99.90th=[17433], 99.95th=[18220], | 99.99th=[19006] bw ( KiB/s): min=45370, max=55784, per=100.00%, avg=51332.33, stdev=1843.77, samples=40 iops : min=11342, max=13946, avg=12832.83, stdev=460.92, samples=40 lat (usec) : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=96.44%, 20=3.53% cpu : usr=2.91%, sys=95.18%, ctx=398, majf=0, minf=73 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=50.1MiB/s (52.5MB/s), 50.1MiB/s-50.1MiB/s (52.5MB/s-52.5MB/s), io=1024MiB (1074MB), run=20453-20453msec Disk stats (read/write): nvme0n1: ios=261399/0, merge=0/0, ticks=16019/0, in_queue=20910, util=100.00% ------------------------------------------------------ OR, here's a better run WITH the patch applied, and you can see that this is nearly as good as the "without" case: ------------------------------------------------------ reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 00m:00s] reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov 6 20:01:33 2018 read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec) slat (usec): min=30, max=12458, avg=69.71, stdev=88.01 clat (usec): min=39, max=25590, avg=4687.42, stdev=1925.29 lat (usec): min=97, max=25704, avg=4757.25, stdev=1946.06 clat percentiles (usec): | 1.00th=[ 3884], 5.00th=[ 3884], 10.00th=[ 3916], 20.00th=[ 3916], | 30.00th=[ 3916], 40.00th=[ 3916], 50.00th=[ 3949], 60.00th=[ 3949], | 70.00th=[ 3982], 80.00th=[ 4555], 90.00th=[ 6915], 95.00th=[ 8848], | 99.00th=[13566], 99.50th=[14877], 99.90th=[16909], 99.95th=[17695], | 99.99th=[24249] bw ( KiB/s): min=48905, max=58016, per=100.00%, avg=53855.79, stdev=2115.03, samples=38 iops : min=12226, max=14504, avg=13463.79, stdev=528.76, samples=38 lat (usec) : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=71.80%, 10=24.66%, 20=3.51%, 50=0.02% cpu : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=52.5MiB/s (55.1MB/s), 52.5MiB/s-52.5MiB/s (55.1MB/s-55.1MB/s), io=1024MiB (1074MB), run=19499-19499msec Disk stats (read/write): nvme0n1: ios=260720/0, merge=0/0, ticks=15036/0, in_queue=19876, util=100.00% [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()" [2] fio: https://github.com/axboe/fio John Hubbard (6): mm/gup: finish consolidating error handling mm: introduce put_user_page*(), placeholder versions infiniband/mm: convert put_page() to put_user_page*() mm: introduce page->dma_pinned_flags, _count mm: introduce zone_gup_lock, for dma-pinned pages mm: track gup pages with page->dma_pinned_* fields drivers/infiniband/core/umem.c | 7 +- drivers/infiniband/core/umem_odp.c | 2 +- drivers/infiniband/hw/hfi1/user_pages.c | 11 +- drivers/infiniband/hw/mthca/mthca_memfree.c | 6 +- drivers/infiniband/hw/qib/qib_user_pages.c | 11 +- drivers/infiniband/hw/qib/qib_user_sdma.c | 6 +- drivers/infiniband/hw/usnic/usnic_uiom.c | 7 +- include/linux/mm.h | 13 ++ include/linux/mm_types.h | 22 +++- include/linux/mmzone.h | 6 + include/linux/page-flags.h | 61 +++++++++ mm/gup.c | 58 +++++++- mm/memcontrol.c | 8 ++ mm/page_alloc.c | 1 + mm/swap.c | 138 ++++++++++++++++++++ 15 files changed, 320 insertions(+), 37 deletions(-)