Message ID | 20201105194416.GA1384085@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | virtiofsd: Use --thread-pool-size=0 to mean no thread pool | expand |
On Thu, Nov 05, 2020 at 02:44:16PM -0500, Vivek Goyal wrote: > Right now we create a thread pool and main thread hands over the request > to thread in thread pool to process. Number of threads in thread pool > can be managed by option --thread-pool-size. > > There is a chance that in case of some workloads, we might get better > performance if we don't handover the request to a different thread > and process in the context of thread receiving the request. > > To implement that, redefine the meaning of --thread-pool-size=0 to > mean that don't use a thread pool. Instead process the request in > the context of thread receiving request from the queue. > > I can't think how --thread-pool-size=0 is useful and hence using > that. If it is already useful somehow, I could look at defining > a new option say "--no-thread-pool". > > I think this patch will be used more as a debug help to do comparison > when it is more effecient to do not hand over the requests to a > thread pool. I ran virtiofs-tests to comapre --thread-pool-size=0 and --thread-pool-size=64. And results seem to be all over the place. In some cases thread pool seems to perform batter and in other cases no-thread-pool seems to perform better. But in general it looks like that except for the case of libaio workload, no-thread-pool is helping. Thanks Vivek NAME WORKLOAD Bandwidth IOPS thread-pool seqread-psync 682.4mb 170.6k no-thread-pool seqread-psync 679.3mb 169.8k thread-pool seqread-psync-multi 2415.9mb 603.9k no-thread-pool seqread-psync-multi 2528.5mb 632.1k thread-pool seqread-mmap 591.7mb 147.9k no-thread-pool seqread-mmap 595.6mb 148.9k thread-pool seqread-mmap-multi 2195.3mb 548.8k no-thread-pool seqread-mmap-multi 2286.1mb 571.5k thread-pool seqread-libaio 329.1mb 82.2k no-thread-pool seqread-libaio 271.5mb 67.8k thread-pool seqread-libaio-multi 1387.1mb 346.7k no-thread-pool seqread-libaio-multi 1508.2mb 377.0k thread-pool randread-psync 59.0mb 14.7k no-thread-pool randread-psync 78.5mb 19.6k thread-pool randread-psync-multi 226.4mb 56.6k no-thread-pool randread-psync-multi 289.2mb 72.3k thread-pool randread-mmap 48.4mb 12.1k no-thread-pool randread-mmap 57.4mb 14.3k thread-pool randread-mmap-multi 183.5mb 45.8k no-thread-pool randread-mmap-multi 240.5mb 60.1k thread-pool randread-libaio 329.4mb 82.3k no-thread-pool randread-libaio 261.8mb 65.4k thread-pool randread-libaio-multi 252.1mb 63.0k no-thread-pool randread-libaio-multi 278.3mb 69.5k thread-pool seqwrite-psync 54.9mb 13.7k no-thread-pool seqwrite-psync 77.8mb 19.4k thread-pool seqwrite-psync-multi 219.9mb 54.9k no-thread-pool seqwrite-psync-multi 314.8mb 78.7k thread-pool seqwrite-mmap 191.7mb 47.9k no-thread-pool seqwrite-mmap 201.4mb 50.3k thread-pool seqwrite-mmap-multi 346.6mb 86.6k no-thread-pool seqwrite-mmap-multi 387.6mb 96.9k thread-pool seqwrite-libaio 306.4mb 76.6k no-thread-pool seqwrite-libaio 248.2mb 62.0k thread-pool seqwrite-libaio-multi 328.5mb 82.1k no-thread-pool seqwrite-libaio-multi 305.6mb 76.4k thread-pool randwrite-psync 55.6mb 13.9k no-thread-pool randwrite-psync 81.2mb 20.3k thread-pool randwrite-psync-multi 227.0mb 56.7k no-thread-pool randwrite-psync-multi 311.6mb 77.8k thread-pool randwrite-mmap 35.3mb 9038 no-thread-pool randwrite-mmap 58.1mb 14.5k thread-pool randwrite-mmap-multi 140.8mb 35.2k no-thread-pool randwrite-mmap-multi 210.5mb 52.6k thread-pool randwrite-libaio 307.1mb 76.7k no-thread-pool randwrite-libaio 279.4mb 69.8k thread-pool randwrite-libaio-multi 361.3mb 90.3k no-thread-pool randwrite-libaio-multi 302.6mb 75.6k thread-pool randrw-psync 34.1mb/11.4mb 8754/2929 no-thread-pool randrw-psync 38.7mb/12.9mb 9919/3321 thread-pool randrw-psync-multi 126.2mb/42.3mb 31.5k/10.5k no-thread-pool randrw-psync-multi 172.2mb/57.7mb 43.0k/14.4k thread-pool randrw-mmap 24.1mb/8270kb 6194/2067 no-thread-pool randrw-mmap 42.0mb/14.0mb 10.5k/3606 thread-pool randrw-mmap-multi 127.9mb/42.8mb 31.9k/10.7k no-thread-pool randrw-mmap-multi 179.4mb/60.0mb 44.8k/15.0k thread-pool randrw-libaio 64.0mb/21.4mb 16.0k/5485 no-thread-pool randrw-libaio 79.6mb/26.6mb 19.9k/6831 thread-pool randrw-libaio-multi 174.2mb/58.3mb 43.5k/14.5k no-thread-pool randrw-libaio-multi 201.6mb/67.5mb 50.4k/16.8k
Hi Vivek, I have tested with Kata 1.12-apha0, the results seems that are better for the use fio config I am tracking. The fio config does randrw: fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75 - I can see better results with this patch - 9pfs is still better in the case of Kata because of the use of: https://github.com/kata-containers/packaging/blob/stable-1.12/qemu/patches/5.0.x/0001-9p-removing-coroutines-of-9p-to-increase-the-I-O-per.patch Results: ./fio-results-run_virtiofs_tread_pool_0 READ: bw=42.8MiB/s (44.8MB/s), 42.8MiB/s-42.8MiB/s (44.8MB/s-44.8MB/s), io=150MiB (157MB), run=3507-3507msec WRITE: bw=14.3MiB/s (14.9MB/s), 14.3MiB/s-14.3MiB/s (14.9MB/s-14.9MB/s), io=49.0MiB (52.4MB), run=3507-3507msec ./fio-results-run_9pfs READ: bw=55.1MiB/s (57.8MB/s), 55.1MiB/s-55.1MiB/s (57.8MB/s-57.8MB/s), io=150MiB (157MB), run=2722-2722msec WRITE: bw=18.4MiB/s (19.3MB/s), 18.4MiB/s-18.4MiB/s (19.3MB/s-19.3MB/s), io=49.0MiB (52.4MB), run=2722-2722msec ./fio-results-run_virtiofs_tread_pool_1 READ: bw=34.5MiB/s (36.1MB/s), 34.5MiB/s-34.5MiB/s (36.1MB/s-36.1MB/s), io=150MiB (157MB), run=4354-4354msec WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=49.0MiB (52.4MB), run=4354-4354msec ./fio-results-run_virtiofs_tread_pool_64 READ: bw=32.3MiB/s (33.8MB/s), 32.3MiB/s-32.3MiB/s (33.8MB/s-33.8MB/s), io=150MiB (157MB), run=4648-4648msec WRITE: bw=10.8MiB/s (11.3MB/s), 10.8MiB/s-10.8MiB/s (11.3MB/s-11.3MB/s), io=49.0MiB (52.4MB), run=4648-4648msec Next: - run https://github.com/rhvgoyal/virtiofs-tests for tread_pool_0, tread_pool_64, and 9pfs - Test https://lore.kernel.org/linux-fsdevel/20201009181512.65496-1-vgoyal@redhat.com/ All the testing for kata is based in: https://github.com/jcvenegas/mrunner/blob/master/scripts/bash_workloads/build-qemu-and-run-fio-test.sh I ran this using an azure VM. Cheers, Carlos On 05/11/20 13:53, "Vivek Goyal" <vgoyal@redhat.com> wrote: On Thu, Nov 05, 2020 at 02:44:16PM -0500, Vivek Goyal wrote: > Right now we create a thread pool and main thread hands over the request > to thread in thread pool to process. Number of threads in thread pool > can be managed by option --thread-pool-size. > > There is a chance that in case of some workloads, we might get better > performance if we don't handover the request to a different thread > and process in the context of thread receiving the request. > > To implement that, redefine the meaning of --thread-pool-size=0 to > mean that don't use a thread pool. Instead process the request in > the context of thread receiving request from the queue. > > I can't think how --thread-pool-size=0 is useful and hence using > that. If it is already useful somehow, I could look at defining > a new option say "--no-thread-pool". > > I think this patch will be used more as a debug help to do comparison > when it is more effecient to do not hand over the requests to a > thread pool. I ran virtiofs-tests to comapre --thread-pool-size=0 and --thread-pool-size=64. And results seem to be all over the place. In some cases thread pool seems to perform batter and in other cases no-thread-pool seems to perform better. But in general it looks like that except for the case of libaio workload, no-thread-pool is helping. Thanks Vivek NAME WORKLOAD Bandwidth IOPS thread-pool seqread-psync 682.4mb 170.6k no-thread-pool seqread-psync 679.3mb 169.8k thread-pool seqread-psync-multi 2415.9mb 603.9k no-thread-pool seqread-psync-multi 2528.5mb 632.1k thread-pool seqread-mmap 591.7mb 147.9k no-thread-pool seqread-mmap 595.6mb 148.9k thread-pool seqread-mmap-multi 2195.3mb 548.8k no-thread-pool seqread-mmap-multi 2286.1mb 571.5k thread-pool seqread-libaio 329.1mb 82.2k no-thread-pool seqread-libaio 271.5mb 67.8k thread-pool seqread-libaio-multi 1387.1mb 346.7k no-thread-pool seqread-libaio-multi 1508.2mb 377.0k thread-pool randread-psync 59.0mb 14.7k no-thread-pool randread-psync 78.5mb 19.6k thread-pool randread-psync-multi 226.4mb 56.6k no-thread-pool randread-psync-multi 289.2mb 72.3k thread-pool randread-mmap 48.4mb 12.1k no-thread-pool randread-mmap 57.4mb 14.3k thread-pool randread-mmap-multi 183.5mb 45.8k no-thread-pool randread-mmap-multi 240.5mb 60.1k thread-pool randread-libaio 329.4mb 82.3k no-thread-pool randread-libaio 261.8mb 65.4k thread-pool randread-libaio-multi 252.1mb 63.0k no-thread-pool randread-libaio-multi 278.3mb 69.5k thread-pool seqwrite-psync 54.9mb 13.7k no-thread-pool seqwrite-psync 77.8mb 19.4k thread-pool seqwrite-psync-multi 219.9mb 54.9k no-thread-pool seqwrite-psync-multi 314.8mb 78.7k thread-pool seqwrite-mmap 191.7mb 47.9k no-thread-pool seqwrite-mmap 201.4mb 50.3k thread-pool seqwrite-mmap-multi 346.6mb 86.6k no-thread-pool seqwrite-mmap-multi 387.6mb 96.9k thread-pool seqwrite-libaio 306.4mb 76.6k no-thread-pool seqwrite-libaio 248.2mb 62.0k thread-pool seqwrite-libaio-multi 328.5mb 82.1k no-thread-pool seqwrite-libaio-multi 305.6mb 76.4k thread-pool randwrite-psync 55.6mb 13.9k no-thread-pool randwrite-psync 81.2mb 20.3k thread-pool randwrite-psync-multi 227.0mb 56.7k no-thread-pool randwrite-psync-multi 311.6mb 77.8k thread-pool randwrite-mmap 35.3mb 9038 no-thread-pool randwrite-mmap 58.1mb 14.5k thread-pool randwrite-mmap-multi 140.8mb 35.2k no-thread-pool randwrite-mmap-multi 210.5mb 52.6k thread-pool randwrite-libaio 307.1mb 76.7k no-thread-pool randwrite-libaio 279.4mb 69.8k thread-pool randwrite-libaio-multi 361.3mb 90.3k no-thread-pool randwrite-libaio-multi 302.6mb 75.6k thread-pool randrw-psync 34.1mb/11.4mb 8754/2929 no-thread-pool randrw-psync 38.7mb/12.9mb 9919/3321 thread-pool randrw-psync-multi 126.2mb/42.3mb 31.5k/10.5k no-thread-pool randrw-psync-multi 172.2mb/57.7mb 43.0k/14.4k thread-pool randrw-mmap 24.1mb/8270kb 6194/2067 no-thread-pool randrw-mmap 42.0mb/14.0mb 10.5k/3606 thread-pool randrw-mmap-multi 127.9mb/42.8mb 31.9k/10.7k no-thread-pool randrw-mmap-multi 179.4mb/60.0mb 44.8k/15.0k thread-pool randrw-libaio 64.0mb/21.4mb 16.0k/5485 no-thread-pool randrw-libaio 79.6mb/26.6mb 19.9k/6831 thread-pool randrw-libaio-multi 174.2mb/58.3mb 43.5k/14.5k no-thread-pool randrw-libaio-multi 201.6mb/67.5mb 50.4k/16.8k
On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos wrote: > Hi Vivek, > > I have tested with Kata 1.12-apha0, the results seems that are better for the use fio config I am tracking. > > The fio config does randrw: > > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75 > Hi Carlos, Thanks for the testing. So basically two conclusions from your tests. - for virtiofs, --thread-pool-size=0 is performing better as comapred to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately 35-40% better. - virtio-9p is still approximately 30% better than virtiofs --thread-pool-size=0. As I had done the analysis that this particular workload (mixed read and write) is bad with virtiofs because after every write we are invalidating attrs and cache so next read ends up fetching attrs again. I had posted patches to gain some of the performance. https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.com/ But I got the feedback to look into implementing file leases instead. Anyway, good to know that --thread-pool-size=0 narrows the performance gap substantially (more than half) for this workload. I will look into identifying further bottlenecks. Given in my tests, most of the workload benefited from --thread-pool-size=0, may be this can be the default and user can opt-in for a thread pool (--thread-pool-size=64) once it is known that a particular workload benefits from it. David, Stefan, WDYT? Thanks Vivek > - I can see better results with this patch > - 9pfs is still better in the case of Kata because of the use of: > https://github.com/kata-containers/packaging/blob/stable-1.12/qemu/patches/5.0.x/0001-9p-removing-coroutines-of-9p-to-increase-the-I-O-per.patch > > Results: > ./fio-results-run_virtiofs_tread_pool_0 > READ: bw=42.8MiB/s (44.8MB/s), 42.8MiB/s-42.8MiB/s (44.8MB/s-44.8MB/s), io=150MiB (157MB), run=3507-3507msec > WRITE: bw=14.3MiB/s (14.9MB/s), 14.3MiB/s-14.3MiB/s (14.9MB/s-14.9MB/s), io=49.0MiB (52.4MB), run=3507-3507msec > ./fio-results-run_9pfs > READ: bw=55.1MiB/s (57.8MB/s), 55.1MiB/s-55.1MiB/s (57.8MB/s-57.8MB/s), io=150MiB (157MB), run=2722-2722msec > WRITE: bw=18.4MiB/s (19.3MB/s), 18.4MiB/s-18.4MiB/s (19.3MB/s-19.3MB/s), io=49.0MiB (52.4MB), run=2722-2722msec > ./fio-results-run_virtiofs_tread_pool_1 > READ: bw=34.5MiB/s (36.1MB/s), 34.5MiB/s-34.5MiB/s (36.1MB/s-36.1MB/s), io=150MiB (157MB), run=4354-4354msec > WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=49.0MiB (52.4MB), run=4354-4354msec > ./fio-results-run_virtiofs_tread_pool_64 > READ: bw=32.3MiB/s (33.8MB/s), 32.3MiB/s-32.3MiB/s (33.8MB/s-33.8MB/s), io=150MiB (157MB), run=4648-4648msec > WRITE: bw=10.8MiB/s (11.3MB/s), 10.8MiB/s-10.8MiB/s (11.3MB/s-11.3MB/s), io=49.0MiB (52.4MB), run=4648-4648msec > > Next: > - run https://github.com/rhvgoyal/virtiofs-tests for tread_pool_0, tread_pool_64, and 9pfs > - Test https://lore.kernel.org/linux-fsdevel/20201009181512.65496-1-vgoyal@redhat.com/ > > All the testing for kata is based in: > https://github.com/jcvenegas/mrunner/blob/master/scripts/bash_workloads/build-qemu-and-run-fio-test.sh > > I ran this using an azure VM. > > Cheers, > Carlos > > > On 05/11/20 13:53, "Vivek Goyal" <vgoyal@redhat.com> wrote: > > On Thu, Nov 05, 2020 at 02:44:16PM -0500, Vivek Goyal wrote: > > Right now we create a thread pool and main thread hands over the request > > to thread in thread pool to process. Number of threads in thread pool > > can be managed by option --thread-pool-size. > > > > There is a chance that in case of some workloads, we might get better > > performance if we don't handover the request to a different thread > > and process in the context of thread receiving the request. > > > > To implement that, redefine the meaning of --thread-pool-size=0 to > > mean that don't use a thread pool. Instead process the request in > > the context of thread receiving request from the queue. > > > > I can't think how --thread-pool-size=0 is useful and hence using > > that. If it is already useful somehow, I could look at defining > > a new option say "--no-thread-pool". > > > > I think this patch will be used more as a debug help to do comparison > > when it is more effecient to do not hand over the requests to a > > thread pool. > > I ran virtiofs-tests to comapre --thread-pool-size=0 and > --thread-pool-size=64. And results seem to be all over the place. In > some cases thread pool seems to perform batter and in other cases > no-thread-pool seems to perform better. > > But in general it looks like that except for the case of libaio workload, > no-thread-pool is helping. > > Thanks > Vivek > > NAME WORKLOAD Bandwidth IOPS > thread-pool seqread-psync 682.4mb 170.6k > no-thread-pool seqread-psync 679.3mb 169.8k > > thread-pool seqread-psync-multi 2415.9mb 603.9k > no-thread-pool seqread-psync-multi 2528.5mb 632.1k > > thread-pool seqread-mmap 591.7mb 147.9k > no-thread-pool seqread-mmap 595.6mb 148.9k > > thread-pool seqread-mmap-multi 2195.3mb 548.8k > no-thread-pool seqread-mmap-multi 2286.1mb 571.5k > > thread-pool seqread-libaio 329.1mb 82.2k > no-thread-pool seqread-libaio 271.5mb 67.8k > > thread-pool seqread-libaio-multi 1387.1mb 346.7k > no-thread-pool seqread-libaio-multi 1508.2mb 377.0k > > thread-pool randread-psync 59.0mb 14.7k > no-thread-pool randread-psync 78.5mb 19.6k > > thread-pool randread-psync-multi 226.4mb 56.6k > no-thread-pool randread-psync-multi 289.2mb 72.3k > > thread-pool randread-mmap 48.4mb 12.1k > no-thread-pool randread-mmap 57.4mb 14.3k > > thread-pool randread-mmap-multi 183.5mb 45.8k > no-thread-pool randread-mmap-multi 240.5mb 60.1k > > thread-pool randread-libaio 329.4mb 82.3k > no-thread-pool randread-libaio 261.8mb 65.4k > > thread-pool randread-libaio-multi 252.1mb 63.0k > no-thread-pool randread-libaio-multi 278.3mb 69.5k > > thread-pool seqwrite-psync 54.9mb 13.7k > no-thread-pool seqwrite-psync 77.8mb 19.4k > > thread-pool seqwrite-psync-multi 219.9mb 54.9k > no-thread-pool seqwrite-psync-multi 314.8mb 78.7k > > thread-pool seqwrite-mmap 191.7mb 47.9k > no-thread-pool seqwrite-mmap 201.4mb 50.3k > > thread-pool seqwrite-mmap-multi 346.6mb 86.6k > no-thread-pool seqwrite-mmap-multi 387.6mb 96.9k > > thread-pool seqwrite-libaio 306.4mb 76.6k > no-thread-pool seqwrite-libaio 248.2mb 62.0k > > thread-pool seqwrite-libaio-multi 328.5mb 82.1k > no-thread-pool seqwrite-libaio-multi 305.6mb 76.4k > > thread-pool randwrite-psync 55.6mb 13.9k > no-thread-pool randwrite-psync 81.2mb 20.3k > > thread-pool randwrite-psync-multi 227.0mb 56.7k > no-thread-pool randwrite-psync-multi 311.6mb 77.8k > > thread-pool randwrite-mmap 35.3mb 9038 > no-thread-pool randwrite-mmap 58.1mb 14.5k > > thread-pool randwrite-mmap-multi 140.8mb 35.2k > no-thread-pool randwrite-mmap-multi 210.5mb 52.6k > > thread-pool randwrite-libaio 307.1mb 76.7k > no-thread-pool randwrite-libaio 279.4mb 69.8k > > thread-pool randwrite-libaio-multi 361.3mb 90.3k > no-thread-pool randwrite-libaio-multi 302.6mb 75.6k > > thread-pool randrw-psync 34.1mb/11.4mb 8754/2929 > no-thread-pool randrw-psync 38.7mb/12.9mb 9919/3321 > > thread-pool randrw-psync-multi 126.2mb/42.3mb 31.5k/10.5k > no-thread-pool randrw-psync-multi 172.2mb/57.7mb 43.0k/14.4k > > thread-pool randrw-mmap 24.1mb/8270kb 6194/2067 > no-thread-pool randrw-mmap 42.0mb/14.0mb 10.5k/3606 > > thread-pool randrw-mmap-multi 127.9mb/42.8mb 31.9k/10.7k > no-thread-pool randrw-mmap-multi 179.4mb/60.0mb 44.8k/15.0k > > thread-pool randrw-libaio 64.0mb/21.4mb 16.0k/5485 > no-thread-pool randrw-libaio 79.6mb/26.6mb 19.9k/6831 > > thread-pool randrw-libaio-multi 174.2mb/58.3mb 43.5k/14.5k > no-thread-pool randrw-libaio-multi 201.6mb/67.5mb 50.4k/16.8k > >
* Vivek Goyal (vgoyal@redhat.com) wrote: > On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos wrote: > > Hi Vivek, > > > > I have tested with Kata 1.12-apha0, the results seems that are better for the use fio config I am tracking. > > > > The fio config does randrw: > > > > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75 > > > > Hi Carlos, > > Thanks for the testing. > > So basically two conclusions from your tests. > > - for virtiofs, --thread-pool-size=0 is performing better as comapred > to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately > 35-40% better. > > - virtio-9p is still approximately 30% better than virtiofs > --thread-pool-size=0. > > As I had done the analysis that this particular workload (mixed read and > write) is bad with virtiofs because after every write we are invalidating > attrs and cache so next read ends up fetching attrs again. I had posted > patches to gain some of the performance. > > https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.com/ > > But I got the feedback to look into implementing file leases instead. > > Anyway, good to know that --thread-pool-size=0 narrows the performance > gap substantially (more than half) for this workload. I will look into > identifying further bottlenecks. Could you try this thread-pool-size=0 together with your attr patches; does that catch up to 9p (even though I know you need to rework the attr stuff as leases). > Given in my tests, most of the workload benefited from > --thread-pool-size=0, may be this can be the default and user can > opt-in for a thread pool (--thread-pool-size=64) once it is known > that a particular workload benefits from it. > > David, Stefan, WDYT? I think having the option makes sense; the default I'm not sure about; what confuses me is why it helps - although it's also interesting it helps 9p as well. Dave > > Thanks > Vivek > > > - I can see better results with this patch > > - 9pfs is still better in the case of Kata because of the use of: > > https://github.com/kata-containers/packaging/blob/stable-1.12/qemu/patches/5.0.x/0001-9p-removing-coroutines-of-9p-to-increase-the-I-O-per.patch > > > > Results: > > ./fio-results-run_virtiofs_tread_pool_0 > > READ: bw=42.8MiB/s (44.8MB/s), 42.8MiB/s-42.8MiB/s (44.8MB/s-44.8MB/s), io=150MiB (157MB), run=3507-3507msec > > WRITE: bw=14.3MiB/s (14.9MB/s), 14.3MiB/s-14.3MiB/s (14.9MB/s-14.9MB/s), io=49.0MiB (52.4MB), run=3507-3507msec > > ./fio-results-run_9pfs > > READ: bw=55.1MiB/s (57.8MB/s), 55.1MiB/s-55.1MiB/s (57.8MB/s-57.8MB/s), io=150MiB (157MB), run=2722-2722msec > > WRITE: bw=18.4MiB/s (19.3MB/s), 18.4MiB/s-18.4MiB/s (19.3MB/s-19.3MB/s), io=49.0MiB (52.4MB), run=2722-2722msec > > ./fio-results-run_virtiofs_tread_pool_1 > > READ: bw=34.5MiB/s (36.1MB/s), 34.5MiB/s-34.5MiB/s (36.1MB/s-36.1MB/s), io=150MiB (157MB), run=4354-4354msec > > WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=49.0MiB (52.4MB), run=4354-4354msec > > ./fio-results-run_virtiofs_tread_pool_64 > > READ: bw=32.3MiB/s (33.8MB/s), 32.3MiB/s-32.3MiB/s (33.8MB/s-33.8MB/s), io=150MiB (157MB), run=4648-4648msec > > WRITE: bw=10.8MiB/s (11.3MB/s), 10.8MiB/s-10.8MiB/s (11.3MB/s-11.3MB/s), io=49.0MiB (52.4MB), run=4648-4648msec > > > > Next: > > - run https://github.com/rhvgoyal/virtiofs-tests for tread_pool_0, tread_pool_64, and 9pfs > > - Test https://lore.kernel.org/linux-fsdevel/20201009181512.65496-1-vgoyal@redhat.com/ > > > > All the testing for kata is based in: > > https://github.com/jcvenegas/mrunner/blob/master/scripts/bash_workloads/build-qemu-and-run-fio-test.sh > > > > I ran this using an azure VM. > > > > Cheers, > > Carlos > > > > > > On 05/11/20 13:53, "Vivek Goyal" <vgoyal@redhat.com> wrote: > > > > On Thu, Nov 05, 2020 at 02:44:16PM -0500, Vivek Goyal wrote: > > > Right now we create a thread pool and main thread hands over the request > > > to thread in thread pool to process. Number of threads in thread pool > > > can be managed by option --thread-pool-size. > > > > > > There is a chance that in case of some workloads, we might get better > > > performance if we don't handover the request to a different thread > > > and process in the context of thread receiving the request. > > > > > > To implement that, redefine the meaning of --thread-pool-size=0 to > > > mean that don't use a thread pool. Instead process the request in > > > the context of thread receiving request from the queue. > > > > > > I can't think how --thread-pool-size=0 is useful and hence using > > > that. If it is already useful somehow, I could look at defining > > > a new option say "--no-thread-pool". > > > > > > I think this patch will be used more as a debug help to do comparison > > > when it is more effecient to do not hand over the requests to a > > > thread pool. > > > > I ran virtiofs-tests to comapre --thread-pool-size=0 and > > --thread-pool-size=64. And results seem to be all over the place. In > > some cases thread pool seems to perform batter and in other cases > > no-thread-pool seems to perform better. > > > > But in general it looks like that except for the case of libaio workload, > > no-thread-pool is helping. > > > > Thanks > > Vivek > > > > NAME WORKLOAD Bandwidth IOPS > > thread-pool seqread-psync 682.4mb 170.6k > > no-thread-pool seqread-psync 679.3mb 169.8k > > > > thread-pool seqread-psync-multi 2415.9mb 603.9k > > no-thread-pool seqread-psync-multi 2528.5mb 632.1k > > > > thread-pool seqread-mmap 591.7mb 147.9k > > no-thread-pool seqread-mmap 595.6mb 148.9k > > > > thread-pool seqread-mmap-multi 2195.3mb 548.8k > > no-thread-pool seqread-mmap-multi 2286.1mb 571.5k > > > > thread-pool seqread-libaio 329.1mb 82.2k > > no-thread-pool seqread-libaio 271.5mb 67.8k > > > > thread-pool seqread-libaio-multi 1387.1mb 346.7k > > no-thread-pool seqread-libaio-multi 1508.2mb 377.0k > > > > thread-pool randread-psync 59.0mb 14.7k > > no-thread-pool randread-psync 78.5mb 19.6k > > > > thread-pool randread-psync-multi 226.4mb 56.6k > > no-thread-pool randread-psync-multi 289.2mb 72.3k > > > > thread-pool randread-mmap 48.4mb 12.1k > > no-thread-pool randread-mmap 57.4mb 14.3k > > > > thread-pool randread-mmap-multi 183.5mb 45.8k > > no-thread-pool randread-mmap-multi 240.5mb 60.1k > > > > thread-pool randread-libaio 329.4mb 82.3k > > no-thread-pool randread-libaio 261.8mb 65.4k > > > > thread-pool randread-libaio-multi 252.1mb 63.0k > > no-thread-pool randread-libaio-multi 278.3mb 69.5k > > > > thread-pool seqwrite-psync 54.9mb 13.7k > > no-thread-pool seqwrite-psync 77.8mb 19.4k > > > > thread-pool seqwrite-psync-multi 219.9mb 54.9k > > no-thread-pool seqwrite-psync-multi 314.8mb 78.7k > > > > thread-pool seqwrite-mmap 191.7mb 47.9k > > no-thread-pool seqwrite-mmap 201.4mb 50.3k > > > > thread-pool seqwrite-mmap-multi 346.6mb 86.6k > > no-thread-pool seqwrite-mmap-multi 387.6mb 96.9k > > > > thread-pool seqwrite-libaio 306.4mb 76.6k > > no-thread-pool seqwrite-libaio 248.2mb 62.0k > > > > thread-pool seqwrite-libaio-multi 328.5mb 82.1k > > no-thread-pool seqwrite-libaio-multi 305.6mb 76.4k > > > > thread-pool randwrite-psync 55.6mb 13.9k > > no-thread-pool randwrite-psync 81.2mb 20.3k > > > > thread-pool randwrite-psync-multi 227.0mb 56.7k > > no-thread-pool randwrite-psync-multi 311.6mb 77.8k > > > > thread-pool randwrite-mmap 35.3mb 9038 > > no-thread-pool randwrite-mmap 58.1mb 14.5k > > > > thread-pool randwrite-mmap-multi 140.8mb 35.2k > > no-thread-pool randwrite-mmap-multi 210.5mb 52.6k > > > > thread-pool randwrite-libaio 307.1mb 76.7k > > no-thread-pool randwrite-libaio 279.4mb 69.8k > > > > thread-pool randwrite-libaio-multi 361.3mb 90.3k > > no-thread-pool randwrite-libaio-multi 302.6mb 75.6k > > > > thread-pool randrw-psync 34.1mb/11.4mb 8754/2929 > > no-thread-pool randrw-psync 38.7mb/12.9mb 9919/3321 > > > > thread-pool randrw-psync-multi 126.2mb/42.3mb 31.5k/10.5k > > no-thread-pool randrw-psync-multi 172.2mb/57.7mb 43.0k/14.4k > > > > thread-pool randrw-mmap 24.1mb/8270kb 6194/2067 > > no-thread-pool randrw-mmap 42.0mb/14.0mb 10.5k/3606 > > > > thread-pool randrw-mmap-multi 127.9mb/42.8mb 31.9k/10.7k > > no-thread-pool randrw-mmap-multi 179.4mb/60.0mb 44.8k/15.0k > > > > thread-pool randrw-libaio 64.0mb/21.4mb 16.0k/5485 > > no-thread-pool randrw-libaio 79.6mb/26.6mb 19.9k/6831 > > > > thread-pool randrw-libaio-multi 174.2mb/58.3mb 43.5k/14.5k > > no-thread-pool randrw-libaio-multi 201.6mb/67.5mb 50.4k/16.8k > > > >
* Vivek Goyal (vgoyal@redhat.com) wrote: > Right now we create a thread pool and main thread hands over the request > to thread in thread pool to process. Number of threads in thread pool > can be managed by option --thread-pool-size. > > There is a chance that in case of some workloads, we might get better > performance if we don't handover the request to a different thread > and process in the context of thread receiving the request. > > To implement that, redefine the meaning of --thread-pool-size=0 to > mean that don't use a thread pool. Instead process the request in > the context of thread receiving request from the queue. > > I can't think how --thread-pool-size=0 is useful and hence using > that. If it is already useful somehow, I could look at defining > a new option say "--no-thread-pool". > > I think this patch will be used more as a debug help to do comparison > when it is more effecient to do not hand over the requests to a > thread pool. > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com> I think this is OK, but you need to fix the style to match qemu rather than kernel style. (See qemu's scripts/checpatch.pl) Dave > --- > tools/virtiofsd/fuse_virtio.c | 33 ++++++++++++++++++++++++--------- > 1 file changed, 24 insertions(+), 9 deletions(-) > > diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c > index ff86f6d1ce..60aa7cd3e5 100644 > --- a/tools/virtiofsd/fuse_virtio.c > +++ b/tools/virtiofsd/fuse_virtio.c > @@ -695,13 +695,17 @@ static void *fv_queue_thread(void *opaque) > struct VuDev *dev = &qi->virtio_dev->dev; > struct VuVirtq *q = vu_get_queue(dev, qi->qidx); > struct fuse_session *se = qi->virtio_dev->se; > - GThreadPool *pool; > - > - pool = g_thread_pool_new(fv_queue_worker, qi, se->thread_pool_size, FALSE, > - NULL); > - if (!pool) { > - fuse_log(FUSE_LOG_ERR, "%s: g_thread_pool_new failed\n", __func__); > - return NULL; > + GThreadPool *pool = NULL; > + GList *req_list = NULL; > + > + if (se->thread_pool_size) { > + fuse_log(FUSE_LOG_DEBUG, "%s: Creating thread pool for Queue %d\n", __func__, qi->qidx); > + pool = g_thread_pool_new(fv_queue_worker, qi, se->thread_pool_size, > + FALSE, NULL); > + if (!pool) { > + fuse_log(FUSE_LOG_ERR, "%s: g_thread_pool_new failed\n", __func__); > + return NULL; > + } > } > > fuse_log(FUSE_LOG_INFO, "%s: Start for queue %d kick_fd %d\n", __func__, > @@ -780,14 +784,25 @@ static void *fv_queue_thread(void *opaque) > req->bad_in_num = bad_in_num; > req->bad_out_num = bad_out_num; > > - g_thread_pool_push(pool, req, NULL); > + if (!se->thread_pool_size) > + req_list = g_list_prepend(req_list, req); > + else > + g_thread_pool_push(pool, req, NULL); > } > > pthread_mutex_unlock(&qi->vq_lock); > pthread_rwlock_unlock(&qi->virtio_dev->vu_dispatch_rwlock); > + > + // Process all the requests. > + if (!se->thread_pool_size && req_list != NULL) { > + g_list_foreach(req_list, fv_queue_worker, qi); > + g_list_free(req_list); > + req_list = NULL; > + } > } > > - g_thread_pool_free(pool, FALSE, TRUE); > + if (pool) > + g_thread_pool_free(pool, FALSE, TRUE); > > return NULL; > } > -- > 2.25.4 >
On Mon, Nov 09, 2020 at 10:02:18AM +0000, Dr. David Alan Gilbert wrote: [..] > > Given in my tests, most of the workload benefited from > > --thread-pool-size=0, may be this can be the default and user can > > opt-in for a thread pool (--thread-pool-size=64) once it is known > > that a particular workload benefits from it. > > > > David, Stefan, WDYT? > > I think having the option makes sense; the default I'm not sure about; Ok, for now we can probably just introduce this option. And if further test results prove it to be faster for most of the people, then we can consider switching the default. Vivek
On Mon, Nov 09, 2020 at 10:11:02AM +0000, Dr. David Alan Gilbert wrote: > * Vivek Goyal (vgoyal@redhat.com) wrote: > > Right now we create a thread pool and main thread hands over the request > > to thread in thread pool to process. Number of threads in thread pool > > can be managed by option --thread-pool-size. > > > > There is a chance that in case of some workloads, we might get better > > performance if we don't handover the request to a different thread > > and process in the context of thread receiving the request. > > > > To implement that, redefine the meaning of --thread-pool-size=0 to > > mean that don't use a thread pool. Instead process the request in > > the context of thread receiving request from the queue. > > > > I can't think how --thread-pool-size=0 is useful and hence using > > that. If it is already useful somehow, I could look at defining > > a new option say "--no-thread-pool". > > > > I think this patch will be used more as a debug help to do comparison > > when it is more effecient to do not hand over the requests to a > > thread pool. > > > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com> > > I think this is OK, but you need to fix the style to match qemu rather > than kernel style. > (See qemu's scripts/checpatch.pl) Sure. I just posted V2 of the patch and took care of all the issues reported by qemu's scripts/checkpatch.pl. Vivek
On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos wrote: > > Hi Vivek, > > > > I have tested with Kata 1.12-apha0, the results seems that are better for the use fio config I am tracking. > > > > The fio config does randrw: > > > > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75 > > > > Hi Carlos, > > Thanks for the testing. > > So basically two conclusions from your tests. > > - for virtiofs, --thread-pool-size=0 is performing better as comapred > to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately > 35-40% better. > > - virtio-9p is still approximately 30% better than virtiofs > --thread-pool-size=0. > > As I had done the analysis that this particular workload (mixed read and > write) is bad with virtiofs because after every write we are invalidating > attrs and cache so next read ends up fetching attrs again. I had posted > patches to gain some of the performance. > > https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.com/ > > But I got the feedback to look into implementing file leases instead. Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it off for now? 9p doesn't have it, so no point in enabling it for virtiofs by default. Also I think some confusion comes from cache=auto being the default for virtiofs. Not sure what the default is for 9p, but comparing default to default will definitely not be apples to apples since this mode is nonexistent in 9p. 9p:cache=none <-> virtiofs:cache=none 9p:cache=loose <-> virtiofs:cache=always "9p:cache=mmap" and "virtiofs:cache=auto" have no match. Untested patch attached. Thanks, Miklos
On Thu, Nov 12, 2020 at 10:06 AM Miklos Szeredi <mszeredi@redhat.com> wrote: > > On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos wrote: > > > Hi Vivek, > > > > > > I have tested with Kata 1.12-apha0, the results seems that are better for the use fio config I am tracking. > > > > > > The fio config does randrw: > > > > > > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75 > > > > > > > Hi Carlos, > > > > Thanks for the testing. > > > > So basically two conclusions from your tests. > > > > - for virtiofs, --thread-pool-size=0 is performing better as comapred > > to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately > > 35-40% better. > > > > - virtio-9p is still approximately 30% better than virtiofs > > --thread-pool-size=0. > > > > As I had done the analysis that this particular workload (mixed read and > > write) is bad with virtiofs because after every write we are invalidating > > attrs and cache so next read ends up fetching attrs again. I had posted > > patches to gain some of the performance. > > > > https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.com/ > > > > But I got the feedback to look into implementing file leases instead. > > Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it > off for now? 9p doesn't have it, so no point in enabling it for > virtiofs by default. > > Also I think some confusion comes from cache=auto being the default > for virtiofs. Not sure what the default is for 9p, but comparing > default to default will definitely not be apples to apples since this > mode is nonexistent in 9p. > > 9p:cache=none <-> virtiofs:cache=none > 9p:cache=loose <-> virtiofs:cache=always > > "9p:cache=mmap" and "virtiofs:cache=auto" have no match. > > Untested patch attached. Another performance issue with virtiofs could be due to the strict page writeback rules in fuse that are meant to prevent misuse of kernel memory by unprivileged processes. Since virtiofs isn't subject to that limitation, these could be relaxed. Attaching a patch that does one half of this. The other half is getting rid of the page copying, that's a bit more involved, but shouldn't be too difficult. Just need to duplicate the cached writeback callbacks for virtiofs and do away with the complex temp page stuff. Thanks, Miklos
On Donnerstag, 12. November 2020 10:06:37 CET Miklos Szeredi wrote: > On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos wrote: > > > Hi Vivek, > > > > > > I have tested with Kata 1.12-apha0, the results seems that are better > > > for the use fio config I am tracking. > > > > > > The fio config does randrw: > > > > > > fio --direct=1 --gtod_reduce=1 --name=test > > > --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M > > > --readwrite=randrw --rwmixread=75> > > Hi Carlos, > > > > Thanks for the testing. > > > > So basically two conclusions from your tests. > > > > - for virtiofs, --thread-pool-size=0 is performing better as comapred > > > > to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately > > 35-40% better. > > > > - virtio-9p is still approximately 30% better than virtiofs > > > > --thread-pool-size=0. > > > > As I had done the analysis that this particular workload (mixed read and > > write) is bad with virtiofs because after every write we are invalidating > > attrs and cache so next read ends up fetching attrs again. I had posted > > patches to gain some of the performance. > > > > https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.com/ > > > > But I got the feedback to look into implementing file leases instead. > > Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it > off for now? 9p doesn't have it, so no point in enabling it for > virtiofs by default. > > Also I think some confusion comes from cache=auto being the default > for virtiofs. Not sure what the default is for 9p, but comparing > default to default will definitely not be apples to apples since this > mode is nonexistent in 9p. The default for 9p is cache=none. That should be changed to cache=mmap being default IMO, because if users stick with the default setting 'none', it breaks software relying on mmap() calls. > > 9p:cache=none <-> virtiofs:cache=none > 9p:cache=loose <-> virtiofs:cache=always > > "9p:cache=mmap" and "virtiofs:cache=auto" have no match. What does 'auto' do exactly? > > Untested patch attached. > > Thanks, > Miklos Best regards, Christian Schoenebeck
On Thu, Nov 12, 2020 at 12:34 PM Christian Schoenebeck <qemu_oss@crudebyte.com> wrote: > > On Donnerstag, 12. November 2020 10:06:37 CET Miklos Szeredi wrote: > > > > 9p:cache=none <-> virtiofs:cache=none > > 9p:cache=loose <-> virtiofs:cache=always > > > > "9p:cache=mmap" and "virtiofs:cache=auto" have no match. > > What does 'auto' do exactly? It has a configurable timeout (default is 1s) for the attribute and lookup cache and close-to-open consistency for data (flush on close, invalidate on open). This is similar to, but less refined as NFS. It's something that could easily be improved if there's a need for it. Thanks, Miklos
> Not sure what the default is for 9p, but comparing > default to default will definitely not be apples to apples since this > mode is nonexistent in 9p. In Kata we are looking for the best config for fs compatibility and performance. So even if are no apples to apples, we are for the best config for both and comparing the best that each of them can do. In the case of Kata for 9pfs(this is the config we have found has better performance and fs compatibility in general) we have: ``` -device virtio-9p-pci # device type ,disable-modern=false ,fsdev=extra-9p-kataShared # attr: device id for fsdev ,mount_tag=kataShared # attr: tag on how will be found de sharedfs ,romfile= -fsdev local #local: Simply lets QEMU call the individual VFS functions (more or less) directly on host. ,id=extra-9p-kataShared ,path=${SHARED_PATH} # attrs: path to share ,security_model=none # # passthrough: Files are stored using the same credentials as they are created on the guest. This requires QEMU to run as # root. # none: Same as "passthrough" except the sever won't report failures if it fails to set file attributes like ownership # # (chown). This makes a passthrough like security model usable for people who run kvm as non root. ,multidevs=remap ``` The mount options are: ``` trans=virtio ,version=9p2000.L ,cache=mmap ,"nodev" # Security: The nodev mount option specifies that the filesystem cannot contain special devices. ,"msize=8192" # msize: Maximum packet size including any headers. ``` Aditionally we use this patch https://github.com/kata-containers/packaging/blob/stable-1.12/qemu/patches/5.0.x/0001-9p-removing-coroutines-of-9p-to-increase-the-I-O-per.patch In kata for virtiofs I am testing use: ``` -chardev socket ,id=ID-SOCKET ,path=.../vhost-fs.sock # Path to vhost socket -device vhost-user-fs-pci # ,chardev=ID-SOCKET ,tag=kataShared ,romfile= -object memory-backend-file # force use of memory sharable with virtiofsd. ,id=dimm1 ,size=2048M ,mem-path=/dev/shm ,share=on ``` Virtiofsd: ``` -o cache=auto -o no_posix_lock # enable/disable remote posix lock --thread-pool-size=0 ``` And virtiofs mount options: ``` source:\"kataShared\" fstype:\"virtiofs\" ``` With this patch, comparing this two configurations, I have seen better performance with virtiofs in different hosts. - Carlos - On 12/11/20 3:06, "Miklos Szeredi" <mszeredi@redhat.com> wrote: On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos wrote: > > Hi Vivek, > > > > I have tested with Kata 1.12-apha0, the results seems that are better for the use fio config I am tracking. > > > > The fio config does randrw: > > > > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75 > > > > Hi Carlos, > > Thanks for the testing. > > So basically two conclusions from your tests. > > - for virtiofs, --thread-pool-size=0 is performing better as comapred > to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately > 35-40% better. > > - virtio-9p is still approximately 30% better than virtiofs > --thread-pool-size=0. > > As I had done the analysis that this particular workload (mixed read and > write) is bad with virtiofs because after every write we are invalidating > attrs and cache so next read ends up fetching attrs again. I had posted > patches to gain some of the performance. > > https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.com/ > > But I got the feedback to look into implementing file leases instead. Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it off for now? 9p doesn't have it, so no point in enabling it for virtiofs by default. Also I think some confusion comes from cache=auto being the default for virtiofs. Not sure what the default is for 9p, but comparing default to default will definitely not be apples to apples since this mode is nonexistent in 9p. 9p:cache=none <-> virtiofs:cache=none 9p:cache=loose <-> virtiofs:cache=always "9p:cache=mmap" and "virtiofs:cache=auto" have no match. Untested patch attached. Thanks, Miklos
On Dienstag, 17. November 2020 17:00:09 CET Venegas Munoz, Jose Carlos wrote: > > Not sure what the default is for 9p, but comparing > > default to default will definitely not be apples to apples since this > > mode is nonexistent in 9p. > > > In Kata we are looking for the best config for fs compatibility and > performance. So even if are no apples to apples, we are for the best > config for both and comparing the best that each of them can do. > In the case of Kata for 9pfs(this is the config we have found has better > performance and fs compatibility in general) we have: > ``` > -device virtio-9p-pci # device type > ,disable-modern=false > ,fsdev=extra-9p-kataShared # attr: device id for fsdev > ,mount_tag=kataShared # attr: tag on how will be found de sharedfs > ,romfile= > -fsdev local #local: Simply lets QEMU call the individual VFS functions > (more or less) directly on host. > ,id=extra-9p-kataShared > ,path=${SHARED_PATH} # attrs: path to share > ,security_model=none # > # passthrough: Files are stored using the same credentials as they are > created on the guest. This requires QEMU to run as # root. > # none: Same > as "passthrough" except the sever won't report failures if it fails to set > file attributes like ownership # # (chown). This makes a passthrough like > security model usable for people who run kvm as non root. ,multidevs=remap > ``` > > The mount options are: > ``` > trans=virtio > ,version=9p2000.L > ,cache=mmap > ,"nodev" # Security: The nodev mount option specifies that the > filesystem cannot contain special devices. > ,"msize=8192" # msize: Maximum Too small. You should definitely set 'msize' larger, as small a msize value causes 9p requests to be broken down into more and smaller 9p requests: https://wiki.qemu.org/Documentation/9psetup#msize Setting msize to a fairly high value might also substantially increase I/O performance i.e. on large I/O if guest application honors st_blksize returned by calling *stat() [ https://man7.org/linux/man-pages/man2/stat.2.html ]: st_blksize This field gives the "preferred" block size for efficient filesystem I/O. For instance 'time cat /large/file' would be substantially faster, as it automatically adapts the chunk size. > packet size including any headers. ``` > > Aditionally we use this patch > https://github.com/kata-containers/packaging/blob/stable-1.12/qemu/patches/ > 5.0.x/0001-9p-removing-coroutines-of-9p-to-increase-the-I-O-per.patch Interesting radical approach. :) Yeah, the coroutine code in 9pfs is still a huge bottleneck. The problem of the existing coroutine code is that it dispatches too often between threads (i.e. main I/O thread of 9p server and background I/O threads doing the fs driver work) which causes huge latencies. I started to fix that, yet I only completed the Treaddir request handler so far, which however gave a huge performance boost on such Treaddir requests: https://lists.gnu.org/archive/html/qemu-devel/2020-01/msg05539.html The plan is to do the same for the other 9p request handlers, and finally introducing an 'order=relaxed' option (right now the order is strict only). That should end up being faster than your current no-coroutine hack later on, because you would have the benefits of both worlds: low latency and parallelism for the fs driver I/O work. > In kata for virtiofs I am testing use: > ``` > -chardev socket > ,id=ID-SOCKET > ,path=.../vhost-fs.sock # Path to vhost socket > -device vhost-user-fs-pci # > ,chardev=ID-SOCKET > ,tag=kataShared > ,romfile= > > -object memory-backend-file # force use of memory sharable with virtiofsd. > ,id=dimm1 > ,size=2048M > ,mem-path=/dev/shm > ,share=on > ``` > Virtiofsd: > ``` > -o cache=auto > -o no_posix_lock # enable/disable remote posix lock > --thread-pool-size=0 > ``` > > And virtiofs mount options: > ``` > source:\"kataShared\" > fstype:\"virtiofs\" > ``` > > With this patch, comparing this two configurations, I have seen better > performance with virtiofs in different hosts. > - > Carlos > > - > > On 12/11/20 3:06, "Miklos Szeredi" <mszeredi@redhat.com> wrote: > > On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > > On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos > > wrote: > > > > > > Hi Vivek, > > > > > > > > > > > > I have tested with Kata 1.12-apha0, the results seems that are > > > better for the use fio config I am tracking. > > > > > > > > > > > > > > The fio config does randrw: > > > > > > > > > > > > fio --direct=1 --gtod_reduce=1 --name=test > > > --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M > > > --readwrite=randrw --rwmixread=75 > > > > > > > > > > > > > > > > Hi Carlos, > > > > > > > > Thanks for the testing. > > > > > > > > So basically two conclusions from your tests. > > > > > > > > - for virtiofs, --thread-pool-size=0 is performing better as comapred > > > > to --thread-pool-size=1 as well as --thread-pool-size=64. > > Approximately > > 35-40% better. > > > > > > > > - virtio-9p is still approximately 30% better than virtiofs > > > > --thread-pool-size=0. > > > > > > > > As I had done the analysis that this particular workload (mixed read > > and > > write) is bad with virtiofs because after every write we are > > invalidating > > attrs and cache so next read ends up fetching attrs again. I had > > posted > > patches to gain some of the performance. > > > > > > > > https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.c > > om/ > > > > > > > > But I got the feedback to look into implementing file leases instead. > > > Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it > off for now? 9p doesn't have it, so no point in enabling it for > virtiofs by default. > > Also I think some confusion comes from cache=auto being the default > for virtiofs. Not sure what the default is for 9p, but comparing > default to default will definitely not be apples to apples since this > mode is nonexistent in 9p. > > 9p:cache=none <-> virtiofs:cache=none > 9p:cache=loose <-> virtiofs:cache=always > > "9p:cache=mmap" and "virtiofs:cache=auto" have no match. > > Untested patch attached. > > Thanks, > Miklos > Best regards, Christian Schoenebeck
On Tue, Nov 17, 2020 at 04:00:09PM +0000, Venegas Munoz, Jose Carlos wrote: > > Not sure what the default is for 9p, but comparing > > default to default will definitely not be apples to apples since this > > mode is nonexistent in 9p. > > In Kata we are looking for the best config for fs compatibility and performance. So even if are no apples to apples, > we are for the best config for both and comparing the best that each of them can do. > > In the case of Kata for 9pfs(this is the config we have found has better performance and fs compatibility in general) we have: > ``` > -device virtio-9p-pci # device type > ,disable-modern=false > ,fsdev=extra-9p-kataShared # attr: device id for fsdev > ,mount_tag=kataShared # attr: tag on how will be found de sharedfs > ,romfile= > -fsdev local #local: Simply lets QEMU call the individual VFS functions (more or less) directly on host. > ,id=extra-9p-kataShared > ,path=${SHARED_PATH} # attrs: path to share > ,security_model=none # > # passthrough: Files are stored using the same credentials as they are created on the guest. This requires QEMU to run as # root. > # none: Same as "passthrough" except the sever won't report failures if it fails to set file attributes like ownership # > # (chown). This makes a passthrough like security model usable for people who run kvm as non root. > ,multidevs=remap > ``` > > The mount options are: > ``` > trans=virtio > ,version=9p2000.L > ,cache=mmap > ,"nodev" # Security: The nodev mount option specifies that the filesystem cannot contain special devices. > ,"msize=8192" # msize: Maximum packet size including any headers. > ``` How much RAM you are giving to these containers when using virtio-9p? Vivek
On Tue, Nov 17, 2020 at 04:00:09PM +0000, Venegas Munoz, Jose Carlos wrote: > > Not sure what the default is for 9p, but comparing > > default to default will definitely not be apples to apples since this > > mode is nonexistent in 9p. > > In Kata we are looking for the best config for fs compatibility and performance. So even if are no apples to apples, > we are for the best config for both and comparing the best that each of them can do. Can we run tests in more than one configuration. Right now you are using cache=mmap for virtio-9p and cache=auto for virtiofs and as Miklos said this is not apples to apples comparison. So you can continue to run above but also run additional tests if time permits. virtio-9p virtio-fs cache=mmap cache=none + DAX cache=none cache=none cache=loose cache=always Given you are using cache=mmap for virtio-9p, "cache=none + DAX" is somewhat equivalent of that. Provides strong coherency as well as allow for mmap() to work. Now kernel virtiofs DAX support got merged in 5.10-rc1. For qemu, you can use virtio-fs-dev branch. https://gitlab.com/virtio-fs/qemu/-/commits/virtio-fs-dev Thanks Vivek
On Thu, Nov 12, 2020 at 10:06:37AM +0100, Miklos Szeredi wrote: > On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos wrote: > > > Hi Vivek, > > > > > > I have tested with Kata 1.12-apha0, the results seems that are better for the use fio config I am tracking. > > > > > > The fio config does randrw: > > > > > > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75 > > > > > > > Hi Carlos, > > > > Thanks for the testing. > > > > So basically two conclusions from your tests. > > > > - for virtiofs, --thread-pool-size=0 is performing better as comapred > > to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately > > 35-40% better. > > > > - virtio-9p is still approximately 30% better than virtiofs > > --thread-pool-size=0. > > > > As I had done the analysis that this particular workload (mixed read and > > write) is bad with virtiofs because after every write we are invalidating > > attrs and cache so next read ends up fetching attrs again. I had posted > > patches to gain some of the performance. > > > > https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.com/ > > > > But I got the feedback to look into implementing file leases instead. > > Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it > off for now? 9p doesn't have it, so no point in enabling it for > virtiofs by default. If we disable FUSE_AUTO_INVAL_DATA, then client page cache will not be invalidated even after 1 sec, right? (for cache=auto). Given now we also want to target sharing directory tree among multiple clients, keeping FUSE_AUTO_INVAL_DATA enabled should help. > > Also I think some confusion comes from cache=auto being the default > for virtiofs. Not sure what the default is for 9p, but comparing > default to default will definitely not be apples to apples since this > mode is nonexistent in 9p. > > 9p:cache=none <-> virtiofs:cache=none > 9p:cache=loose <-> virtiofs:cache=always > > "9p:cache=mmap" and "virtiofs:cache=auto" have no match. Agreed from performance comparison point of view. I will prefer cache=auto default (over cache=none) for virtiofsd. During some kernel compilation tests over virtiofs, cache=none was painfully slow as compared to cache=auto. Thanks Vivek
On Thu, Nov 12, 2020 at 10:57:11AM +0100, Miklos Szeredi wrote: [..] > > Another performance issue with virtiofs could be due to the strict > page writeback rules in fuse that are meant to prevent misuse of > kernel memory by unprivileged processes. Since virtiofs isn't > subject to that limitation, these could be relaxed. Hi Miklos, I tried this patch with some of the write mmap workloads and it seems to help. I ran virtiofsd with following options. $ virtiofsd --socket-path=/tmp/vhostqemu -o source=/mnt//virtiofs-source -o no_posix_lock -o cache=auto --thread-pool-size=0 --daemonize Note, these workloads are not doing any fsync after write. So they are effectively testing how fast one can do cached writes. NAME WORKLOAD Bandwidth IOPS limit-bdi seqwrite-mmap 201.4mb 50.3k nolimit-bdi seqwrite-mmap 253.4mb 63.3k limit-bdi seqwrite-mmap-multi 386.7mb 96.6k nolimit-bdi seqwrite-mmap-multi 752.2mb 188.0k limit-bdi randwrite-mmap 53.5mb 13.3k nolimit-bdi randwrite-mmap 60.6mb 15.1k limit-bdi randwrite-mmap-multi 206.3mb 51.5k nolimit-bdi randwrite-mmap-multi 255.5mb 63.8k seqwrite-mmap-multi seems to see the biggest jump. So it might be a good idea to apply this patch. > > Attaching a patch that does one half of this. The other half is > getting rid of the page copying, that's a bit more involved, but > shouldn't be too difficult. Just need to duplicate the cached > writeback callbacks for virtiofs and do away with the complex temp > page stuff. Aha..., so we don't need all the complexity related to allocating those temporary pages and then keeping track of writes in rb tree and waiting for writes to finish etc. And it could be more like a regular filesystem where lot of this stuff could be taken care by common code for the case of virtiofs. That will be really nice. Less code complexity to deal with. Also might provide performance improvement. Thanks Vivek
On Wed, Nov 18, 2020 at 8:52 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Thu, Nov 12, 2020 at 10:06:37AM +0100, Miklos Szeredi wrote: > > On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos wrote: > > > > Hi Vivek, > > > > > > > > I have tested with Kata 1.12-apha0, the results seems that are better for the use fio config I am tracking. > > > > > > > > The fio config does randrw: > > > > > > > > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75 > > > > > > > > > > Hi Carlos, > > > > > > Thanks for the testing. > > > > > > So basically two conclusions from your tests. > > > > > > - for virtiofs, --thread-pool-size=0 is performing better as comapred > > > to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately > > > 35-40% better. > > > > > > - virtio-9p is still approximately 30% better than virtiofs > > > --thread-pool-size=0. > > > > > > As I had done the analysis that this particular workload (mixed read and > > > write) is bad with virtiofs because after every write we are invalidating > > > attrs and cache so next read ends up fetching attrs again. I had posted > > > patches to gain some of the performance. > > > > > > https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.com/ > > > > > > But I got the feedback to look into implementing file leases instead. > > > > Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it > > off for now? 9p doesn't have it, so no point in enabling it for > > virtiofs by default. > > If we disable FUSE_AUTO_INVAL_DATA, then client page cache will not > be invalidated even after 1 sec, right? (for cache=auto). Unless FOPEN_KEEP_CACHE is used (only cache=always, AFAICS) data cache will be invalidated on open. I think it's what NFS does, except NFS does invalidation based on mtime *at the time of open*. At least that's what I remember from past reading of NFS code. > Given now we also want to target sharing directory tree among multiple > clients, keeping FUSE_AUTO_INVAL_DATA enabled should help. Depends what applications expect. THe close-to-open coherency provided even without FUSE_AUTO_INVAL_DATA should be enough for the case when strong coherency is not required. > > > > > Also I think some confusion comes from cache=auto being the default > > for virtiofs. Not sure what the default is for 9p, but comparing > > default to default will definitely not be apples to apples since this > > mode is nonexistent in 9p. > > > > 9p:cache=none <-> virtiofs:cache=none > > 9p:cache=loose <-> virtiofs:cache=always > > > > "9p:cache=mmap" and "virtiofs:cache=auto" have no match. > > Agreed from performance comparison point of view. > > I will prefer cache=auto default (over cache=none) for virtiofsd. During > some kernel compilation tests over virtiofs, cache=none was painfully > slow as compared to cache=auto. It would also be interesting to know the exact bottleneck in the kernel compile with cache=none case. Thanks, Miklos
For all the cases the memory for the guest is 2G. On 17/11/20 12:55, "Vivek Goyal" <vgoyal@redhat.com> wrote: On Tue, Nov 17, 2020 at 04:00:09PM +0000, Venegas Munoz, Jose Carlos wrote: > > Not sure what the default is for 9p, but comparing > > default to default will definitely not be apples to apples since this > > mode is nonexistent in 9p. > > In Kata we are looking for the best config for fs compatibility and performance. So even if are no apples to apples, > we are for the best config for both and comparing the best that each of them can do. > > In the case of Kata for 9pfs(this is the config we have found has better performance and fs compatibility in general) we have: > ``` > -device virtio-9p-pci # device type > ,disable-modern=false > ,fsdev=extra-9p-kataShared # attr: device id for fsdev > ,mount_tag=kataShared # attr: tag on how will be found de sharedfs > ,romfile= > -fsdev local #local: Simply lets QEMU call the individual VFS functions (more or less) directly on host. > ,id=extra-9p-kataShared > ,path=${SHARED_PATH} # attrs: path to share > ,security_model=none # > # passthrough: Files are stored using the same credentials as they are created on the guest. This requires QEMU to run as # root. > # none: Same as "passthrough" except the sever won't report failures if it fails to set file attributes like ownership # > # (chown). This makes a passthrough like security model usable for people who run kvm as non root. > ,multidevs=remap > ``` > > The mount options are: > ``` > trans=virtio > ,version=9p2000.L > ,cache=mmap > ,"nodev" # Security: The nodev mount option specifies that the filesystem cannot contain special devices. > ,"msize=8192" # msize: Maximum packet size including any headers. > ``` How much RAM you are giving to these containers when using virtio-9p? Vivek
Cool, thanks I will bring some results this week. On 17/11/20 16:24, "Vivek Goyal" <vgoyal@redhat.com> wrote: On Tue, Nov 17, 2020 at 04:00:09PM +0000, Venegas Munoz, Jose Carlos wrote: > > Not sure what the default is for 9p, but comparing > > default to default will definitely not be apples to apples since this > > mode is nonexistent in 9p. > > In Kata we are looking for the best config for fs compatibility and performance. So even if are no apples to apples, > we are for the best config for both and comparing the best that each of them can do. Can we run tests in more than one configuration. Right now you are using cache=mmap for virtio-9p and cache=auto for virtiofs and as Miklos said this is not apples to apples comparison. So you can continue to run above but also run additional tests if time permits. virtio-9p virtio-fs cache=mmap cache=none + DAX cache=none cache=none cache=loose cache=always Given you are using cache=mmap for virtio-9p, "cache=none + DAX" is somewhat equivalent of that. Provides strong coherency as well as allow for mmap() to work. Now kernel virtiofs DAX support got merged in 5.10-rc1. For qemu, you can use virtio-fs-dev branch. https://gitlab.com/virtio-fs/qemu/-/commits/virtio-fs-dev Thanks Vivek
diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c index ff86f6d1ce..60aa7cd3e5 100644 --- a/tools/virtiofsd/fuse_virtio.c +++ b/tools/virtiofsd/fuse_virtio.c @@ -695,13 +695,17 @@ static void *fv_queue_thread(void *opaque) struct VuDev *dev = &qi->virtio_dev->dev; struct VuVirtq *q = vu_get_queue(dev, qi->qidx); struct fuse_session *se = qi->virtio_dev->se; - GThreadPool *pool; - - pool = g_thread_pool_new(fv_queue_worker, qi, se->thread_pool_size, FALSE, - NULL); - if (!pool) { - fuse_log(FUSE_LOG_ERR, "%s: g_thread_pool_new failed\n", __func__); - return NULL; + GThreadPool *pool = NULL; + GList *req_list = NULL; + + if (se->thread_pool_size) { + fuse_log(FUSE_LOG_DEBUG, "%s: Creating thread pool for Queue %d\n", __func__, qi->qidx); + pool = g_thread_pool_new(fv_queue_worker, qi, se->thread_pool_size, + FALSE, NULL); + if (!pool) { + fuse_log(FUSE_LOG_ERR, "%s: g_thread_pool_new failed\n", __func__); + return NULL; + } } fuse_log(FUSE_LOG_INFO, "%s: Start for queue %d kick_fd %d\n", __func__, @@ -780,14 +784,25 @@ static void *fv_queue_thread(void *opaque) req->bad_in_num = bad_in_num; req->bad_out_num = bad_out_num; - g_thread_pool_push(pool, req, NULL); + if (!se->thread_pool_size) + req_list = g_list_prepend(req_list, req); + else + g_thread_pool_push(pool, req, NULL); } pthread_mutex_unlock(&qi->vq_lock); pthread_rwlock_unlock(&qi->virtio_dev->vu_dispatch_rwlock); + + // Process all the requests. + if (!se->thread_pool_size && req_list != NULL) { + g_list_foreach(req_list, fv_queue_worker, qi); + g_list_free(req_list); + req_list = NULL; + } } - g_thread_pool_free(pool, FALSE, TRUE); + if (pool) + g_thread_pool_free(pool, FALSE, TRUE); return NULL; }
Right now we create a thread pool and main thread hands over the request to thread in thread pool to process. Number of threads in thread pool can be managed by option --thread-pool-size. There is a chance that in case of some workloads, we might get better performance if we don't handover the request to a different thread and process in the context of thread receiving the request. To implement that, redefine the meaning of --thread-pool-size=0 to mean that don't use a thread pool. Instead process the request in the context of thread receiving request from the queue. I can't think how --thread-pool-size=0 is useful and hence using that. If it is already useful somehow, I could look at defining a new option say "--no-thread-pool". I think this patch will be used more as a debug help to do comparison when it is more effecient to do not hand over the requests to a thread pool. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> --- tools/virtiofsd/fuse_virtio.c | 33 ++++++++++++++++++++++++--------- 1 file changed, 24 insertions(+), 9 deletions(-)