Message ID | 20200924131318.2654747-1-balsini@android.com (mailing list archive) |
---|---|
Headers | show |
Series | fuse: Add support for passthrough read/write | expand |
On Thu, Sep 24, 2020 at 3:13 PM Alessio Balsini <balsini@android.com> wrote: > The first benchmarks were done by running FIO (fio-3.21) with: > - bs=4Ki; > - file size: 50Gi; > - ioengine: sync; > - fsync_on_close: true. > The target file has been chosen large enough to avoid it to be entirely > loaded into the page cache. > Results are presented in the following table: > > +-----------+--------+-------------+--------+ > | Bandwidth | FUSE | FUSE | Bind | > | (KiB/s) | | passthrough | mount | > +-----------+--------+-------------+--------+ > | read | 468897 | 502085 | 516830 | > +-----------+--------+-------------+--------+ > | randread | 15773 | 26632 | 21386 | Have you looked into why passthrough is faster than native? Thanks, Miklos
On Wed, Sep 30, 2020 at 05:33:30PM +0200, Miklos Szeredi wrote: > On Thu, Sep 24, 2020 at 3:13 PM Alessio Balsini <balsini@android.com> wrote: > > > The first benchmarks were done by running FIO (fio-3.21) with: > > - bs=4Ki; > > - file size: 50Gi; > > - ioengine: sync; > > - fsync_on_close: true. > > The target file has been chosen large enough to avoid it to be entirely > > loaded into the page cache. > > Results are presented in the following table: > > > > +-----------+--------+-------------+--------+ > > | Bandwidth | FUSE | FUSE | Bind | > > | (KiB/s) | | passthrough | mount | > > +-----------+--------+-------------+--------+ > > | read | 468897 | 502085 | 516830 | > > +-----------+--------+-------------+--------+ > > | randread | 15773 | 26632 | 21386 | > > > Have you looked into why passthrough is faster than native? > > Thanks, > Miklos Hi Miklos, Thank you for bringing this to my attention, I probably missed it because focusing on the comparison between FUSE and FUSE passthrough. I jumped back to benchmarkings right after you sent this email. At a first glance I though I made a stupid copy-paste mistake, but looking at a bunch of partial results I'm collecting, I realized that the Vi550 S3 SSD I'm using has sometimes unstable performance, especially when dealing with random offsets. I also realized that SSD performance might change depending on previous operations. To solve these issues, each test is now being run 10 times, and at post-processing time I'm thinking of getting the median to remove possible outliers. I also noticed that the performance noise increases after a few minutes the SSD is busy. This made me think of some kind of SSD thermal throttling I totally overlooked. This might be reason why passthrough is performing better than native in the numbers you highlighted. Unfortunately the SMART registers of my SSD always reports 33 Celsius degrees regardless the workload, so to solve this I'm now applying a 5 minutes cooldown between each run. This time I'm also removing fsync_on_close and reducing the file size to 25 GiB to improve caching and limit the interaction with the SSD during writes. Still for caching reasons I am also separating the creation of the fio target file from the actual execution of the benchmark by first running fio with create_only=1. Before triggering fio, in the above benchmark I was just sync-ing and dropping the pagecache, I now also drop slab objects, including inodes and dentries: echo 3 > /proc/sys/vm/drop_caches that I suspect wouldn't make any difference, but wouldn't harm as well. Please let me know if you have any suggestion on how to improve my benchmarks, or if you recommend tools other than fio (that I actually really like) to make comparisons. Thanks, Alessio
Hi Miklos, all, After being stuck with some strange and hard to reproduce results from my SSD, I finally decided to overcome the biggest chunk of inconsistencies by forgetting about the SSD and switching to a RAM block device to host my lower file system. Getting rid of the discrete storage device removes a huge component of slowness, highlighting the performance difference of the software parts (and probably goodness of CPU cache and its coherence/invalidation mechanisms). More specifically, out of my system's 32 GiB of RAM, I reserved 24 for /dev/ram0, which has been formatted as ext4. That file system has been completely filled and then cleaned up before running the benchmarks to make sure all the memory addresses were marked as used and removed from the page cache. As for the last time, I've been using a slightly modified libfuse passthrough_hp.cc example, that simply enables the passthrough mode at every open/create operation: git@github.com:balsini/libfuse fuse-passthrough-stable-v.3.9.4 The following tests were ran using fio-3.23 with the following configuration: - bs=4Ki - size=20Gi - ioengine=sync - fsync_on_close=1 - randseed=0 - create_only=0 (set to 1 during a first dry run to create the test file) As for the tool configuration, the following benchmarks would perform a single open operation each, focusing on just the read/write perfromance. The file size of 20 GiB has been chosen to not completely fit the page cache. As mentioned in my previous email, all the caches were dropped before running every benchmark with echo 3 > /proc/sys/vm/drop_caches All the benchmarks were run 10 times, with 1 minute cool down between each run. Here the updated results for this patch set: +-----------+-------------+-------------+-------------+ | | | FUSE | | | MiB/s | FUSE | passthrough | native | +-----------+-------------+-------------+-------------+ | read | 1341(±4.2%) | 1485(±1.1%) | 1634(±.5%) | +-----------+-------------+-------------+-------------+ | write | 49(±2.1%) | 1304(±2.6%) | 1363(±3.0%) | +-----------+-------------+-------------+-------------+ | randread | 43(±1.3%) | 643(±11.1%) | 715(±1.1%) | +-----------+-------------+-------------+-------------+ | randwrite | 27(±39.9%) | 763(±1.1%) | 790(±1.0%) | +-----------+-------------+-------------+-------------+ This table shows that FUSE, except for the sequential reads, is left behind FUSE passthrough and native performance. The extremely good FUSE performance for sequential reads is the result of a great read-ahead mechanism, that has been easy to prove by showing that performance dropped after setting read_ahead_kb to 0. Except for FUSE randwrite and passthrough randread with respectively ~40% and ~11% standard deviations, all the other results are relatively stable. Nevertheless, these two standard deviation exceptions are not sufficient to invalidate the results, that are still showing clear performance benefits. I'm also kind of happy to see that passthrough, that for each read/write operation traverses the VFS layer twice, now maintains consistent slightly lower performance than native. I wanted to make sure the results were consistent before jumping back to your feedback on the series. Thanks, Alessio