Message ID | cover.1601495426.git.josef@toxicpanda.com (mailing list archive) |
---|---|
Headers | show |
Series | Improve preemptive ENOSPC flushing | expand |
On 30.09.20 г. 23:01 ч., Josef Bacik wrote: > Hello, > > A while ago Nikolay started digging into a problem where they were seeing an > around 20% regression on random writes, and he bisected it down to > > btrfs: don't end the transaction for delayed refs in throttle > > However this wasn't actually the cause of the problem. > > This patch removed the code that would preemptively end the transactions if we > were low on space. Because we had just introduced the ticketing code, this was > no longer necessary and was causing a lot of transaction commits. > > And in Nikolay's testing he validated this, we would see like 100x more > transaction commits without that patch than with it, but the write regression > clearly appeared when this patch was applied. > > The root cause of this is that the transaction commits were essentially > happening so quickly that we didn't end up needing to wait on space in the > ENOSPC ticketing code as much, and thus were able to write pretty quickly. With > this gone, we now were getting a sawtoothy sort of behavior where we'd run up, > stop while we flushed metadata space, run some more, stop again etc. > > When I implemented the ticketing infrastructure, I was trying to get us out of > excessively flushing space because we would sometimes over create block groups, > and thus short circuited flushing if we no longer had tickets. This had the > side effect of breaking the preemptive flushing code, where we attempted to > flush space in the background before we were forced to wait for space. > > Enter this patchset. We still have some of this preemption logic sprinkled > everywhere, so I've separated it out of the normal ticketed flushing code, and > made preemptive flushing it's own thing. > > The preemptive flushing logic is more specialized than the standard flushing > logic. It attempts to flush in whichever pool has the highest usage. This > means that if most of our space is tied up in pinned extents, we'll commit the > transaction. If most of the space is tied up in delalloc, we'll flush delalloc, > etc. > > To test this out I used the fio job that Nikolay used, this needs to be adjusted > so the overall IO size is at least 2x the RAM size for the box you are testing > > fio --direct=0 --ioengine=sync --thread --directory=/mnt/test --invalidate=1 \ > --group_reporting=1 --runtime=300 --fallocate=none --ramp_time=10 \ > --name=RandomWrites-async-64512-4k-4 --new_group --rw=randwrite \ > --size=2g --numjobs=4 --bs=4k --fsync_on_close=0 --end_fsync=0 \ > --filename_format=FioWorkloads.\$jobnum > > I got the following results > > misc-next: bw=13.4MiB/s (14.0MB/s), 13.4MiB/s-13.4MiB/s (14.0MB/s-14.0MB/s), io=4015MiB (4210MB), run=300323-300323msec > pre-throttling: bw=16.9MiB/s (17.7MB/s), 16.9MiB/s-16.9MiB/s (17.7MB/s-17.7MB/s), io=5068MiB (5314MB), run=300069-300069msec > my patches: bw=18.0MiB/s (18.9MB/s), 18.0MiB/s-18.0MiB/s (18.9MB/s-18.9MB/s), io=5403MiB (5666MB), run=300001-300001msec > > Thanks, > > Josef > > Josef Bacik (9): > btrfs: add a trace point for reserve tickets > btrfs: improve preemptive background space flushing > btrfs: rename need_do_async_reclaim > btrfs: check reclaim_size in need_preemptive_reclaim > btrfs: rework btrfs_calc_reclaim_metadata_size > btrfs: simplify the logic in need_preemptive_flushing > btrfs: implement space clamping for preemptive flushing > btrfs: adjust the flush trace point to include the source > btrfs: add a trace class for dumping the current ENOSPC state > > fs/btrfs/ctree.h | 1 + > fs/btrfs/disk-io.c | 1 + > fs/btrfs/space-info.c | 216 +++++++++++++++++++++++++++++------ > fs/btrfs/space-info.h | 3 + > include/trace/events/btrfs.h | 112 +++++++++++++++++- > 5 files changed, 292 insertions(+), 41 deletions(-) > Here are my results from testing misc-next/your patches/4.19 for buffered io: With your patches: WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=8192MiB (8590MB), run=281678-281678msec WRITE: bw=30.0MiB/s (32.5MB/s), 30.0MiB/s-30.0MiB/s (32.5MB/s-32.5MB/s), io=8192MiB (8590MB), run=264337-264337msec WRITE: bw=29.6MiB/s (31.1MB/s), 29.6MiB/s-29.6MiB/s (31.1MB/s-31.1MB/s), io=8192MiB (8590MB), run=276312-276312msec WRITE: bw=29.8MiB/s (31.2MB/s), 29.8MiB/s-29.8MiB/s (31.2MB/s-31.2MB/s), io=8192MiB (8590MB), run=274916-274916msec WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269030-269030msec Without-misc-next: WRITE: bw=20.2MiB/s (21.2MB/s), 20.2MiB/s-20.2MiB/s (21.2MB/s-21.2MB/s), io=8192MiB (8590MB), run=404831-404831msec WRITE: bw=20.8MiB/s (21.8MB/s), 20.8MiB/s-20.8MiB/s (21.8MB/s-21.8MB/s), io=8192MiB (8590MB), run=394749-394749msec WRITE: bw=20.8MiB/s (21.8MB/s), 20.8MiB/s-20.8MiB/s (21.8MB/s-21.8MB/s), io=8192MiB (8590MB), run=393291-393291msec WRITE: bw=20.7MiB/s (21.8MB/s), 20.7MiB/s-20.7MiB/s (21.8MB/s-21.8MB/s), io=8192MiB (8590MB), run=394918-394918msec WRITE: bw=21.1MiB/s (22.1MB/s), 21.1MiB/s-21.1MiB/s (22.1MB/s-22.1MB/s), io=8192MiB (8590MB), run=388499-388499msec 4.19.x: WRITE: bw=23.3MiB/s (24.4MB/s), 23.3MiB/s-23.3MiB/s (24.4MB/s-24.4MB/s), io=6387MiB (6697MB), run=274460-274460msec WRITE: bw=23.3MiB/s (24.5MB/s), 23.3MiB/s-23.3MiB/s (24.5MB/s-24.5MB/s), io=6643MiB (6966MB), run=284518-284518msec WRITE: bw=23.4MiB/s (24.5MB/s), 23.4MiB/s-23.4MiB/s (24.5MB/s-24.5MB/s), io=6643MiB (6966MB), run=284372-284372msec WRITE: bw=23.6MiB/s (24.7MB/s), 23.6MiB/s-23.6MiB/s (24.7MB/s-24.7MB/s), io=6387MiB (6697MB), run=271200-271200msec WRITE: bw=23.4MiB/s (24.6MB/s), 23.4MiB/s-23.4MiB/s (24.6MB/s-24.6MB/s), io=6387MiB (6697MB), run=272670-272670msec So there is indeed a net-gain in performance, however, (expectedly, due to the increased number of transaction commits) there is a regressions in DIO: DIO-josef: WRITE: bw=47.1MiB/s (49.4MB/s), 47.1MiB/s-47.1MiB/s (49.4MB/s-49.4MB/s), io=8192MiB (8590MB), run=174049-174049msec WRITE: bw=48.5MiB/s (50.8MB/s), 48.5MiB/s-48.5MiB/s (50.8MB/s-50.8MB/s), io=8192MiB (8590MB), run=169045-169045msec WRITE: bw=45.0MiB/s (48.2MB/s), 45.0MiB/s-45.0MiB/s (48.2MB/s-48.2MB/s), io=8192MiB (8590MB), run=178196-178196msec WRITE: bw=46.1MiB/s (48.3MB/s), 46.1MiB/s-46.1MiB/s (48.3MB/s-48.3MB/s), io=8192MiB (8590MB), run=177861-177861msec WRITE: bw=46.4MiB/s (48.7MB/s), 46.4MiB/s-46.4MiB/s (48.7MB/s-48.7MB/s), io=8192MiB (8590MB), run=176376-176376msec DIO-misc-next: WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163365-163365msec WRITE: bw=50.3MiB/s (52.8MB/s), 50.3MiB/s-50.3MiB/s (52.8MB/s-52.8MB/s), io=8192MiB (8590MB), run=162753-162753msec WRITE: bw=50.6MiB/s (53.1MB/s), 50.6MiB/s-50.6MiB/s (53.1MB/s-53.1MB/s), io=8192MiB (8590MB), run=161766-161766msec WRITE: bw=50.2MiB/s (52.7MB/s), 50.2MiB/s-50.2MiB/s (52.7MB/s-52.7MB/s), io=8192MiB (8590MB), run=163074-163074msec WRITE: bw=50.5MiB/s (52.9MB/s), 50.5MiB/s-50.5MiB/s (52.9MB/s-52.9MB/s), io=8192MiB (8590MB), run=162252-162252msec With that: Tested-by: Nikolay Borisov <nborisov@suse.com>