Message ID | 20181203102532.13945-1-nborisov@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] btrfs: Remove __extent_readpages | expand |
On 3.12.18 г. 12:25 ч., Nikolay Borisov wrote: > When extent_readpages is called from the generic readahead code it first > builds a batch of 16 pages (which might or might not be consecutive, > depending on whether add_to_page_cache_lru failed) and submits them to > __extent_readpages. The latter ensures that the range of pages (in the > batch of 16) that is passed to __do_contiguous_readpages is consecutive. > > If add_to_page_cache_lru does't fail then __extent_readpages will call > __do_contiguous_readpages only once with the whole batch of 16. > Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an example) > then the contigous page read code will be called twice. > > All of this can be simplified by exploiting the fact that all pages passed > to extent_readpages are consecutive, thus when the batch is built in > that function it is already consecutive (barring add_to_page_cache_lru > failures) so are ready to be submitted directly to __do_contiguous_readpages. > Also simplify the name of the function to contiguous_readpages. > > Signed-off-by: Nikolay Borisov <nborisov@suse.com> > --- > > So this patch looks like a very nice cleanup, however when doing performance > measurements with fio I was shocked to see that it actually is detrimental to > performance. Here are the results: > > The command line used for fio: > fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k > --numjobs=1 --size=1G --runtime=600 --group_reporting --loop 10 > > This was tested on a vm with 4g of ram so the size of the test is smaller than > the memory, so pages should have been nicely readahead. > > PATCHED: > > Starting 1 process > Jobs: 1 (f=1): [R(1)][100.0%][r=519MiB/s][r=133k IOPS][eta 00m:00s] > /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=3722: Mon Dec 3 09:57:17 2018 > read: IOPS=78.4k, BW=306MiB/s (321MB/s)(10.0GiB/33444msec) > clat (nsec): min=1703, max=9042.7k, avg=5463.97, stdev=121068.28 > lat (usec): min=2, max=9043, avg= 6.00, stdev=121.07 > clat percentiles (nsec): > | 1.00th=[ 1848], 5.00th=[ 1896], 10.00th=[ 1912], > | 20.00th=[ 1960], 30.00th=[ 2024], 40.00th=[ 2160], > | 50.00th=[ 2384], 60.00th=[ 2576], 70.00th=[ 2800], > | 80.00th=[ 3120], 90.00th=[ 3824], 95.00th=[ 4768], > | 99.00th=[ 7968], 99.50th=[ 14912], 99.90th=[ 50944], > | 99.95th=[ 667648], 99.99th=[5931008] > bw ( KiB/s): min= 2768, max=544542, per=100.00%, avg=409912.68, stdev=162333.72, samples=50 > iops : min= 692, max=136135, avg=102478.08, stdev=40583.47, samples=50 > lat (usec) : 2=25.93%, 4=65.58%, 10=7.69%, 20=0.57%, 50=0.13% > lat (usec) : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% > lat (msec) : 2=0.01%, 4=0.01%, 10=0.05% > cpu : usr=7.20%, sys=92.55%, ctx=396, majf=0, minf=9 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > READ: bw=306MiB/s (321MB/s), 306MiB/s-306MiB/s (321MB/s-321MB/s), io=10.0GiB (10.7GB), run=33444-33444msec > > > UNPATCHED: > > Starting 1 process > Jobs: 1 (f=1): [R(1)][100.0%][r=568MiB/s][r=145k IOPS][eta 00m:00s] > /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=640: Mon Dec 3 10:07:38 2018 > read: IOPS=90.4k, BW=353MiB/s (370MB/s)(10.0GiB/29008msec) > clat (nsec): min=1418, max=12374k, avg=4816.38, stdev=109448.00 > lat (nsec): min=1836, max=12374k, avg=5284.46, stdev=109451.36 > clat percentiles (nsec): > | 1.00th=[ 1576], 5.00th=[ 1608], 10.00th=[ 1640], > | 20.00th=[ 1672], 30.00th=[ 1720], 40.00th=[ 1832], > | 50.00th=[ 2096], 60.00th=[ 2288], 70.00th=[ 2480], > | 80.00th=[ 2736], 90.00th=[ 3248], 95.00th=[ 3952], > | 99.00th=[ 6368], 99.50th=[ 12736], 99.90th=[ 43776], > | 99.95th=[ 798720], 99.99th=[5341184] > bw ( KiB/s): min=34144, max=606208, per=100.00%, avg=465737.56, stdev=177637.57, samples=45 > iops : min= 8536, max=151552, avg=116434.33, stdev=44409.46, samples=45 > lat (usec) : 2=45.74%, 4=49.50%, 10=4.13%, 20=0.45%, 50=0.08% > lat (usec) : 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% > lat (msec) : 2=0.01%, 4=0.01%, 10=0.05%, 20=0.01% > cpu : usr=7.14%, sys=92.39%, ctx=1059, majf=0, minf=9 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > READ: bw=353MiB/s (370MB/s), 353MiB/s-353MiB/s (370MB/s-370MB/s), io=10.0GiB (10.7GB), run=29008-29008msec > > Clearly both bandwidth and iops are worse. However I'm puzzled as to why this is > the case, given that I don't see how this patch affects the submission of > readahead io. > I took slightly different measurements, namely: call count of contiguous_readpages/__do_contiguous_readpages, latency histogram of calls to extent_readpages as well as the output from cachetop (from bcc tools). Also I switched qemu's cache settings to 'none' for the virtio device which fio is writing to! The vms still has 4g of ram and this time the fio cmd line is: fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k --numjobs=1 --size=2G --runtime=600 --group_reporting --loop 10 (I've modified the file to be 2g rather than 1 as in the original patch) So here are the results: UNPATCHED CASE: /media/scratch/seqread: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1 fio-3.12 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=470MiB/s][r=120k IOPS][eta 00m:00s] /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=643: Mon Dec 3 13:55:09 2018 read: IOPS=82.5k, BW=322MiB/s (338MB/s)(20.0GiB/63559msec) clat (nsec): min=1427, max=38694k, avg=5613.13, stdev=126477.18 lat (nsec): min=1819, max=38695k, avg=6061.95, stdev=126479.46 clat percentiles (nsec): | 1.00th=[ 1480], 5.00th=[ 1512], 10.00th=[ 1528], | 20.00th=[ 1560], 30.00th=[ 1576], 40.00th=[ 1592], | 50.00th=[ 1624], 60.00th=[ 1656], 70.00th=[ 1720], | 80.00th=[ 1896], 90.00th=[ 2640], 95.00th=[ 3024], | 99.00th=[ 4768], 99.50th=[ 12352], 99.90th=[1003520], | 99.95th=[3817472], 99.99th=[4685824] bw ( KiB/s): min=45728, max=549616, per=100.00%, avg=460140.08, stdev=144985.90, samples=91 iops : min=11432, max=137404, avg=115035.09, stdev=36246.50, samples=91 lat (usec) : 2=82.15%, 4=16.25%, 10=1.01%, 20=0.34%, 50=0.10% lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.02% lat (msec) : 2=0.02%, 4=0.04%, 10=0.04%, 20=0.01%, 50=0.01% cpu : usr=5.81%, sys=81.11%, ctx=5682, majf=0, minf=10 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% FUNC COUNT __do_contiguous_readpages 327935 Here is a histogram of latencies in microseconds 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 10 | | 64 -> 127 : 20 | | 128 -> 255 : 6 | | 256 -> 511 : 13 | | 512 -> 1023 : 11 | | 1024 -> 2047 : 10 | | 2048 -> 4095 : 2516 |****************************************| 4096 -> 8191 : 34 | | Output from cachetop for the duration of the test: TOTAL MISSES HITS DIRTIES BUFFERS_MB CACHED_MB 354 1 353 1 14 1063 543360 524288 19072 0 14 2232 379788 369092 10696 0 14 1956 329056 316960 12096 0 14 1147 377836 362181 15655 0 14 766 543360 524288 19072 0 14 2036 437191 423764 13427 0 14 2169 327289 317440 9849 0 14 1361 322175 307376 14799 0 14 513 543352 524284 19068 0 14 1680 481780 466852 14928 0 14 2338 343440 331528 11912 0 14 1585 344403 333824 10579 0 14 841 564139 440492 123647 0 14 2160 total hits:294800 total misses: 5242370 PATCHED CASE: /media/scratch/seqread: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1 fio-3.12 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=529MiB/s][r=135k IOPS][eta 00m:00s] /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=620: Mon Dec 3 11:56:49 2018 read: IOPS=81.2k, BW=317MiB/s (332MB/s)(10.0GiB/32302msec) clat (nsec): min=1428, max=152945k, avg=5927.07, stdev=246422.91 lat (nsec): min=1837, max=152946k, avg=6427.46, stdev=246426.89 clat percentiles (nsec): | 1.00th=[ 1496], 5.00th=[ 1528], 10.00th=[ 1560], | 20.00th=[ 1608], 30.00th=[ 1672], 40.00th=[ 1736], | 50.00th=[ 1880], 60.00th=[ 2192], 70.00th=[ 2448], | 80.00th=[ 2736], 90.00th=[ 3280], 95.00th=[ 4080], | 99.00th=[ 7328], 99.50th=[ 14528], 99.90th=[ 242688], | 99.95th=[3719168], 99.99th=[5013504] bw ( KiB/s): min= 1432, max=549808, per=100.00%, avg=396769.62, stdev=175814.08, samples=52 iops : min= 358, max=137452, avg=99192.48, stdev=43953.67, samples=52 lat (usec) : 2=53.85%, 4=40.84%, 10=4.54%, 20=0.47%, 50=0.15% lat (usec) : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.02%, 10=0.04%, 20=0.01%, 50=0.01% lat (msec) : 100=0.01%, 250=0.01% cpu : usr=6.53%, sys=83.13%, ctx=2746, majf=0, minf=11 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=317MiB/s (332MB/s), 317MiB/s-317MiB/s (332MB/s-332MB/s), io=10.0GiB (10.7GB), run=32302-32302msec FUNC COUNT contiguous_readpages 327935 So the total number of times we call the low-level submission function is identical across the 2 versions. This implies that the patch doesn't result in worse "merge" behavior i.e submitting more but smaller contiguous chunks via contiguous_readpages. Cachetop: TOTAL MISSES HITS DIRTIES BUFFERS_MB CACHED_MB 475016 459690 15326 0 15 2313 316483 309138 7345 2 15 1472 326726 315393 11333 1 15 656 508038 487596 20442 0 15 1710 470487 456532 13955 0 15 2296 329494 318160 11334 0 15 1492 312000 300217 11783 0 15 617 518013 497629 20384 0 15 1552 518780 503636 15144 0 15 2480 338471 325632 12839 0 15 1705 325678 315328 10350 0 15 888 447023 428204 18819 0 15 1127 647182 524289 122893 0 15 2160 total hits: 291947<--- hits are negligibly smaller total misses: 5241444<--- so are misses
On Mon, Dec 03, 2018 at 12:25:32PM +0200, Nikolay Borisov wrote: > When extent_readpages is called from the generic readahead code it first > builds a batch of 16 pages (which might or might not be consecutive, > depending on whether add_to_page_cache_lru failed) and submits them to > __extent_readpages. The latter ensures that the range of pages (in the > batch of 16) that is passed to __do_contiguous_readpages is consecutive. > > If add_to_page_cache_lru does't fail then __extent_readpages will call > __do_contiguous_readpages only once with the whole batch of 16. > Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an example) > then the contigous page read code will be called twice. > > All of this can be simplified by exploiting the fact that all pages passed > to extent_readpages are consecutive, thus when the batch is built in > that function it is already consecutive (barring add_to_page_cache_lru > failures) so are ready to be submitted directly to __do_contiguous_readpages. > Also simplify the name of the function to contiguous_readpages. > > Signed-off-by: Nikolay Borisov <nborisov@suse.com> > --- > > So this patch looks like a very nice cleanup, however when doing performance > measurements with fio I was shocked to see that it actually is detrimental to > performance. Here are the results: > > The command line used for fio: > fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k > --numjobs=1 --size=1G --runtime=600 --group_reporting --loop 10 > > This was tested on a vm with 4g of ram so the size of the test is smaller than > the memory, so pages should have been nicely readahead. > What this patch changes is now we aren't reading all of the pages we are passed by the readahead, so now we fall back to per-page reading when we go to read those pages because the readahead window has already moved past them. This isn't great behavior, what we have is nice in that it tries to group the entire range together as much as possible. What your patch changes is that as soon as we stop having a contiguous range we just bail and submit what we have. Testing it in isolation is likely to show very little change, but having recently touched the fault in code where we definitely do not count major faults in some cases I'd suspect that we're not reflecting this higher fault rate in the performance counters properly. We should preserve the existing behavior, what hurts a little bit on a lightly loaded box is going to hurt a whole lot more on a heavily loaded box. Thanks, Josef
On 5.12.18 г. 18:58 ч., Josef Bacik wrote: > On Mon, Dec 03, 2018 at 12:25:32PM +0200, Nikolay Borisov wrote: >> When extent_readpages is called from the generic readahead code it first >> builds a batch of 16 pages (which might or might not be consecutive, >> depending on whether add_to_page_cache_lru failed) and submits them to >> __extent_readpages. The latter ensures that the range of pages (in the >> batch of 16) that is passed to __do_contiguous_readpages is consecutive. >> >> If add_to_page_cache_lru does't fail then __extent_readpages will call >> __do_contiguous_readpages only once with the whole batch of 16. >> Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an example) >> then the contigous page read code will be called twice. >> >> All of this can be simplified by exploiting the fact that all pages passed >> to extent_readpages are consecutive, thus when the batch is built in >> that function it is already consecutive (barring add_to_page_cache_lru >> failures) so are ready to be submitted directly to __do_contiguous_readpages. >> Also simplify the name of the function to contiguous_readpages. >> >> Signed-off-by: Nikolay Borisov <nborisov@suse.com> >> --- >> >> So this patch looks like a very nice cleanup, however when doing performance >> measurements with fio I was shocked to see that it actually is detrimental to >> performance. Here are the results: >> >> The command line used for fio: >> fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k >> --numjobs=1 --size=1G --runtime=600 --group_reporting --loop 10 >> >> This was tested on a vm with 4g of ram so the size of the test is smaller than >> the memory, so pages should have been nicely readahead. >> > > What this patch changes is now we aren't reading all of the pages we are passed > by the readahead, so now we fall back to per-page reading when we go to read > those pages because the readahead window has already moved past them. This > isn't great behavior, what we have is nice in that it tries to group the entire > range together as much as possible. What your patch changes is that as soon as > we stop having a contiguous range we just bail and submit what we have. Testing > it in isolation is likely to show very little change, but having recently > touched the fault in code where we definitely do not count major faults in some > cases I'd suspect that we're not reflecting this higher fault rate in the > performance counters properly. We should preserve the existing behavior, what > hurts a little bit on a lightly loaded box is going to hurt a whole lot more on > a heavily loaded box. Thanks, Following our recent chat regarding this (and the previous prep patch) I think both doesn't constitute difference in behavior, just making the code a bit clearer. Also Mel was kind enough to run more rigorous performance tests and the result were identical. The main concern in your statements comes from the fact that the prep patch actually introduced a for loop which creates the batch so the break you see here is for this nested loop and not for the main loop that reads all pages passed by the RA code. > > Josef >
On 3.12.18 г. 12:25 ч., Nikolay Borisov wrote: > When extent_readpages is called from the generic readahead code it first > builds a batch of 16 pages (which might or might not be consecutive, > depending on whether add_to_page_cache_lru failed) and submits them to > __extent_readpages. The latter ensures that the range of pages (in the > batch of 16) that is passed to __do_contiguous_readpages is consecutive. > > If add_to_page_cache_lru does't fail then __extent_readpages will call > __do_contiguous_readpages only once with the whole batch of 16. > Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an example) > then the contigous page read code will be called twice. > > All of this can be simplified by exploiting the fact that all pages passed > to extent_readpages are consecutive, thus when the batch is built in > that function it is already consecutive (barring add_to_page_cache_lru > failures) so are ready to be submitted directly to __do_contiguous_readpages. > Also simplify the name of the function to contiguous_readpages. > > Signed-off-by: Nikolay Borisov <nborisov@suse.com> > --- > > So this patch looks like a very nice cleanup, however when doing performance > measurements with fio I was shocked to see that it actually is detrimental to > performance. Here are the results: > > The command line used for fio: > fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k > --numjobs=1 --size=1G --runtime=600 --group_reporting --loop 10 > > This was tested on a vm with 4g of ram so the size of the test is smaller than > the memory, so pages should have been nicely readahead. > > PATCHED: > > Starting 1 process > Jobs: 1 (f=1): [R(1)][100.0%][r=519MiB/s][r=133k IOPS][eta 00m:00s] > /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=3722: Mon Dec 3 09:57:17 2018 > read: IOPS=78.4k, BW=306MiB/s (321MB/s)(10.0GiB/33444msec) > clat (nsec): min=1703, max=9042.7k, avg=5463.97, stdev=121068.28 > lat (usec): min=2, max=9043, avg= 6.00, stdev=121.07 > clat percentiles (nsec): > | 1.00th=[ 1848], 5.00th=[ 1896], 10.00th=[ 1912], > | 20.00th=[ 1960], 30.00th=[ 2024], 40.00th=[ 2160], > | 50.00th=[ 2384], 60.00th=[ 2576], 70.00th=[ 2800], > | 80.00th=[ 3120], 90.00th=[ 3824], 95.00th=[ 4768], > | 99.00th=[ 7968], 99.50th=[ 14912], 99.90th=[ 50944], > | 99.95th=[ 667648], 99.99th=[5931008] > bw ( KiB/s): min= 2768, max=544542, per=100.00%, avg=409912.68, stdev=162333.72, samples=50 > iops : min= 692, max=136135, avg=102478.08, stdev=40583.47, samples=50 > lat (usec) : 2=25.93%, 4=65.58%, 10=7.69%, 20=0.57%, 50=0.13% > lat (usec) : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% > lat (msec) : 2=0.01%, 4=0.01%, 10=0.05% > cpu : usr=7.20%, sys=92.55%, ctx=396, majf=0, minf=9 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > READ: bw=306MiB/s (321MB/s), 306MiB/s-306MiB/s (321MB/s-321MB/s), io=10.0GiB (10.7GB), run=33444-33444msec > > > UNPATCHED: > > Starting 1 process > Jobs: 1 (f=1): [R(1)][100.0%][r=568MiB/s][r=145k IOPS][eta 00m:00s] > /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=640: Mon Dec 3 10:07:38 2018 > read: IOPS=90.4k, BW=353MiB/s (370MB/s)(10.0GiB/29008msec) > clat (nsec): min=1418, max=12374k, avg=4816.38, stdev=109448.00 > lat (nsec): min=1836, max=12374k, avg=5284.46, stdev=109451.36 > clat percentiles (nsec): > | 1.00th=[ 1576], 5.00th=[ 1608], 10.00th=[ 1640], > | 20.00th=[ 1672], 30.00th=[ 1720], 40.00th=[ 1832], > | 50.00th=[ 2096], 60.00th=[ 2288], 70.00th=[ 2480], > | 80.00th=[ 2736], 90.00th=[ 3248], 95.00th=[ 3952], > | 99.00th=[ 6368], 99.50th=[ 12736], 99.90th=[ 43776], > | 99.95th=[ 798720], 99.99th=[5341184] > bw ( KiB/s): min=34144, max=606208, per=100.00%, avg=465737.56, stdev=177637.57, samples=45 > iops : min= 8536, max=151552, avg=116434.33, stdev=44409.46, samples=45 > lat (usec) : 2=45.74%, 4=49.50%, 10=4.13%, 20=0.45%, 50=0.08% > lat (usec) : 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% > lat (msec) : 2=0.01%, 4=0.01%, 10=0.05%, 20=0.01% > cpu : usr=7.14%, sys=92.39%, ctx=1059, majf=0, minf=9 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > READ: bw=353MiB/s (370MB/s), 353MiB/s-353MiB/s (370MB/s-370MB/s), io=10.0GiB (10.7GB), run=29008-29008msec > > Clearly both bandwidth and iops are worse. However I'm puzzled as to why this is > the case, given that I don't see how this patch affects the submission of > readahead io. > > fs/btrfs/extent_io.c | 63 +++++++++++++------------------------------- > 1 file changed, 19 insertions(+), 44 deletions(-) Revisiting the patch and the results IMHO it shouldn't be RFC and is ready for merging. So gentle ping :) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 233f835dd6f8..c097eec1b73d 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -3059,7 +3059,7 @@ static int __do_readpage(struct extent_io_tree *tree, > return ret; > } > > -static inline void __do_contiguous_readpages(struct extent_io_tree *tree, > +static inline void contiguous_readpages(struct extent_io_tree *tree, > struct page *pages[], int nr_pages, > u64 start, u64 end, > struct extent_map **em_cached, > @@ -3090,46 +3090,6 @@ static inline void __do_contiguous_readpages(struct extent_io_tree *tree, > } > } > > -static void __extent_readpages(struct extent_io_tree *tree, > - struct page *pages[], > - int nr_pages, > - struct extent_map **em_cached, > - struct bio **bio, unsigned long *bio_flags, > - u64 *prev_em_start) > -{ > - u64 start = 0; > - u64 end = 0; > - u64 page_start; > - int index; > - int first_index = 0; > - > - for (index = 0; index < nr_pages; index++) { > - page_start = page_offset(pages[index]); > - if (!end) { > - start = page_start; > - end = start + PAGE_SIZE - 1; > - first_index = index; > - } else if (end + 1 == page_start) { > - end += PAGE_SIZE; > - } else { > - __do_contiguous_readpages(tree, &pages[first_index], > - index - first_index, start, > - end, em_cached, > - bio, bio_flags, > - prev_em_start); > - start = page_start; > - end = start + PAGE_SIZE - 1; > - first_index = index; > - } > - } > - > - if (end) > - __do_contiguous_readpages(tree, &pages[first_index], > - index - first_index, start, > - end, em_cached, bio, > - bio_flags, prev_em_start); > -} > - > static int __extent_read_full_page(struct extent_io_tree *tree, > struct page *page, > get_extent_t *get_extent, > @@ -4104,6 +4064,13 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages, > u64 prev_em_start = (u64)-1; > > while (!list_empty(pages)) { > + u64 contig_end = 0; > + > + /* > + * Produces a batch of up to 16 contiguous pages, assumes > + * that pages are consecutive in pages list (guaranteed by the > + * generic code) > + **/ > for (nr = 0; nr < BTRFS_PAGES_BATCH && !list_empty(pages);) { > struct page *page = lru_to_page(pages); > > @@ -4112,14 +4079,22 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages, > if (add_to_page_cache_lru(page, mapping, page->index, > readahead_gfp_mask(mapping))) { > put_page(page); > - continue; > + break; > } > > pagepool[nr++] = page; > + contig_end = page_offset(page) + PAGE_SIZE - 1; > } > > - __extent_readpages(tree, pagepool, nr, &em_cached, &bio, > - &bio_flags, &prev_em_start); > + if (nr) { > + u64 contig_start = page_offset(pagepool[0]); > + > + ASSERT(contig_start + (nr*PAGE_SIZE) - 1 == contig_end); > + > + contiguous_readpages(tree, pagepool, nr, contig_start, > + contig_end, &em_cached, &bio, &bio_flags, > + &prev_em_start); > + } > } > > if (em_cached) >
On Mon, Feb 11, 2019 at 09:46:10AM +0200, Nikolay Borisov wrote: > > issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0 > > latency : target=0, window=0, percentile=100.00%, depth=1 > > > > Run status group 0 (all jobs): > > READ: bw=353MiB/s (370MB/s), 353MiB/s-353MiB/s (370MB/s-370MB/s), io=10.0GiB (10.7GB), run=29008-29008msec > > > > Clearly both bandwidth and iops are worse. However I'm puzzled as to why this is > > the case, given that I don't see how this patch affects the submission of > > readahead io. > > > > fs/btrfs/extent_io.c | 63 +++++++++++++------------------------------- > > 1 file changed, 19 insertions(+), 44 deletions(-) > > Revisiting the patch and the results IMHO it shouldn't be RFC and is > ready for merging. I almost missed it. Please resend it without RFC, update the changelog with anything relevant that came out of the discussion and fix typos and coding style. The subject should describe what's the core of the change, removing __extent_readpages is IMHO just a side effect. Thanks.
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 233f835dd6f8..c097eec1b73d 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3059,7 +3059,7 @@ static int __do_readpage(struct extent_io_tree *tree, return ret; } -static inline void __do_contiguous_readpages(struct extent_io_tree *tree, +static inline void contiguous_readpages(struct extent_io_tree *tree, struct page *pages[], int nr_pages, u64 start, u64 end, struct extent_map **em_cached, @@ -3090,46 +3090,6 @@ static inline void __do_contiguous_readpages(struct extent_io_tree *tree, } } -static void __extent_readpages(struct extent_io_tree *tree, - struct page *pages[], - int nr_pages, - struct extent_map **em_cached, - struct bio **bio, unsigned long *bio_flags, - u64 *prev_em_start) -{ - u64 start = 0; - u64 end = 0; - u64 page_start; - int index; - int first_index = 0; - - for (index = 0; index < nr_pages; index++) { - page_start = page_offset(pages[index]); - if (!end) { - start = page_start; - end = start + PAGE_SIZE - 1; - first_index = index; - } else if (end + 1 == page_start) { - end += PAGE_SIZE; - } else { - __do_contiguous_readpages(tree, &pages[first_index], - index - first_index, start, - end, em_cached, - bio, bio_flags, - prev_em_start); - start = page_start; - end = start + PAGE_SIZE - 1; - first_index = index; - } - } - - if (end) - __do_contiguous_readpages(tree, &pages[first_index], - index - first_index, start, - end, em_cached, bio, - bio_flags, prev_em_start); -} - static int __extent_read_full_page(struct extent_io_tree *tree, struct page *page, get_extent_t *get_extent, @@ -4104,6 +4064,13 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages, u64 prev_em_start = (u64)-1; while (!list_empty(pages)) { + u64 contig_end = 0; + + /* + * Produces a batch of up to 16 contiguous pages, assumes + * that pages are consecutive in pages list (guaranteed by the + * generic code) + **/ for (nr = 0; nr < BTRFS_PAGES_BATCH && !list_empty(pages);) { struct page *page = lru_to_page(pages); @@ -4112,14 +4079,22 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages, if (add_to_page_cache_lru(page, mapping, page->index, readahead_gfp_mask(mapping))) { put_page(page); - continue; + break; } pagepool[nr++] = page; + contig_end = page_offset(page) + PAGE_SIZE - 1; } - __extent_readpages(tree, pagepool, nr, &em_cached, &bio, - &bio_flags, &prev_em_start); + if (nr) { + u64 contig_start = page_offset(pagepool[0]); + + ASSERT(contig_start + (nr*PAGE_SIZE) - 1 == contig_end); + + contiguous_readpages(tree, pagepool, nr, contig_start, + contig_end, &em_cached, &bio, &bio_flags, + &prev_em_start); + } } if (em_cached)
When extent_readpages is called from the generic readahead code it first builds a batch of 16 pages (which might or might not be consecutive, depending on whether add_to_page_cache_lru failed) and submits them to __extent_readpages. The latter ensures that the range of pages (in the batch of 16) that is passed to __do_contiguous_readpages is consecutive. If add_to_page_cache_lru does't fail then __extent_readpages will call __do_contiguous_readpages only once with the whole batch of 16. Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an example) then the contigous page read code will be called twice. All of this can be simplified by exploiting the fact that all pages passed to extent_readpages are consecutive, thus when the batch is built in that function it is already consecutive (barring add_to_page_cache_lru failures) so are ready to be submitted directly to __do_contiguous_readpages. Also simplify the name of the function to contiguous_readpages. Signed-off-by: Nikolay Borisov <nborisov@suse.com> --- So this patch looks like a very nice cleanup, however when doing performance measurements with fio I was shocked to see that it actually is detrimental to performance. Here are the results: The command line used for fio: fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k --numjobs=1 --size=1G --runtime=600 --group_reporting --loop 10 This was tested on a vm with 4g of ram so the size of the test is smaller than the memory, so pages should have been nicely readahead. PATCHED: Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=519MiB/s][r=133k IOPS][eta 00m:00s] /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=3722: Mon Dec 3 09:57:17 2018 read: IOPS=78.4k, BW=306MiB/s (321MB/s)(10.0GiB/33444msec) clat (nsec): min=1703, max=9042.7k, avg=5463.97, stdev=121068.28 lat (usec): min=2, max=9043, avg= 6.00, stdev=121.07 clat percentiles (nsec): | 1.00th=[ 1848], 5.00th=[ 1896], 10.00th=[ 1912], | 20.00th=[ 1960], 30.00th=[ 2024], 40.00th=[ 2160], | 50.00th=[ 2384], 60.00th=[ 2576], 70.00th=[ 2800], | 80.00th=[ 3120], 90.00th=[ 3824], 95.00th=[ 4768], | 99.00th=[ 7968], 99.50th=[ 14912], 99.90th=[ 50944], | 99.95th=[ 667648], 99.99th=[5931008] bw ( KiB/s): min= 2768, max=544542, per=100.00%, avg=409912.68, stdev=162333.72, samples=50 iops : min= 692, max=136135, avg=102478.08, stdev=40583.47, samples=50 lat (usec) : 2=25.93%, 4=65.58%, 10=7.69%, 20=0.57%, 50=0.13% lat (usec) : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.05% cpu : usr=7.20%, sys=92.55%, ctx=396, majf=0, minf=9 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=306MiB/s (321MB/s), 306MiB/s-306MiB/s (321MB/s-321MB/s), io=10.0GiB (10.7GB), run=33444-33444msec UNPATCHED: Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=568MiB/s][r=145k IOPS][eta 00m:00s] /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=640: Mon Dec 3 10:07:38 2018 read: IOPS=90.4k, BW=353MiB/s (370MB/s)(10.0GiB/29008msec) clat (nsec): min=1418, max=12374k, avg=4816.38, stdev=109448.00 lat (nsec): min=1836, max=12374k, avg=5284.46, stdev=109451.36 clat percentiles (nsec): | 1.00th=[ 1576], 5.00th=[ 1608], 10.00th=[ 1640], | 20.00th=[ 1672], 30.00th=[ 1720], 40.00th=[ 1832], | 50.00th=[ 2096], 60.00th=[ 2288], 70.00th=[ 2480], | 80.00th=[ 2736], 90.00th=[ 3248], 95.00th=[ 3952], | 99.00th=[ 6368], 99.50th=[ 12736], 99.90th=[ 43776], | 99.95th=[ 798720], 99.99th=[5341184] bw ( KiB/s): min=34144, max=606208, per=100.00%, avg=465737.56, stdev=177637.57, samples=45 iops : min= 8536, max=151552, avg=116434.33, stdev=44409.46, samples=45 lat (usec) : 2=45.74%, 4=49.50%, 10=4.13%, 20=0.45%, 50=0.08% lat (usec) : 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.05%, 20=0.01% cpu : usr=7.14%, sys=92.39%, ctx=1059, majf=0, minf=9 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=353MiB/s (370MB/s), 353MiB/s-353MiB/s (370MB/s-370MB/s), io=10.0GiB (10.7GB), run=29008-29008msec Clearly both bandwidth and iops are worse. However I'm puzzled as to why this is the case, given that I don't see how this patch affects the submission of readahead io. fs/btrfs/extent_io.c | 63 +++++++++++++------------------------------- 1 file changed, 19 insertions(+), 44 deletions(-)