mbox series

[0/9] Improve preemptive ENOSPC flushing

Message ID cover.1601495426.git.josef@toxicpanda.com (mailing list archive)
Headers show
Series Improve preemptive ENOSPC flushing | expand

Message

Josef Bacik Sept. 30, 2020, 8:01 p.m. UTC
Hello,

A while ago Nikolay started digging into a problem where they were seeing an
around 20% regression on random writes, and he bisected it down to

  btrfs: don't end the transaction for delayed refs in throttle

However this wasn't actually the cause of the problem.

This patch removed the code that would preemptively end the transactions if we
were low on space.  Because we had just introduced the ticketing code, this was
no longer necessary and was causing a lot of transaction commits.

And in Nikolay's testing he validated this, we would see like 100x more
transaction commits without that patch than with it, but the write regression
clearly appeared when this patch was applied.

The root cause of this is that the transaction commits were essentially
happening so quickly that we didn't end up needing to wait on space in the
ENOSPC ticketing code as much, and thus were able to write pretty quickly.  With
this gone, we now were getting a sawtoothy sort of behavior where we'd run up,
stop while we flushed metadata space, run some more, stop again etc.

When I implemented the ticketing infrastructure, I was trying to get us out of
excessively flushing space because we would sometimes over create block groups,
and thus short circuited flushing if we no longer had tickets.  This had the
side effect of breaking the preemptive flushing code, where we attempted to
flush space in the background before we were forced to wait for space.

Enter this patchset.  We still have some of this preemption logic sprinkled
everywhere, so I've separated it out of the normal ticketed flushing code, and
made preemptive flushing it's own thing.

The preemptive flushing logic is more specialized than the standard flushing
logic.  It attempts to flush in whichever pool has the highest usage.  This
means that if most of our space is tied up in pinned extents, we'll commit the
transaction.  If most of the space is tied up in delalloc, we'll flush delalloc,
etc.

To test this out I used the fio job that Nikolay used, this needs to be adjusted
so the overall IO size is at least 2x the RAM size for the box you are testing

fio --direct=0 --ioengine=sync --thread --directory=/mnt/test --invalidate=1 \
        --group_reporting=1 --runtime=300 --fallocate=none --ramp_time=10 \
        --name=RandomWrites-async-64512-4k-4 --new_group --rw=randwrite \
        --size=2g --numjobs=4 --bs=4k --fsync_on_close=0 --end_fsync=0 \
        --filename_format=FioWorkloads.\$jobnum

I got the following results

misc-next:	bw=13.4MiB/s (14.0MB/s), 13.4MiB/s-13.4MiB/s (14.0MB/s-14.0MB/s), io=4015MiB (4210MB), run=300323-300323msec
pre-throttling:	bw=16.9MiB/s (17.7MB/s), 16.9MiB/s-16.9MiB/s (17.7MB/s-17.7MB/s), io=5068MiB (5314MB), run=300069-300069msec
my patches:	bw=18.0MiB/s (18.9MB/s), 18.0MiB/s-18.0MiB/s (18.9MB/s-18.9MB/s), io=5403MiB (5666MB), run=300001-300001msec

Thanks,

Josef

Josef Bacik (9):
  btrfs: add a trace point for reserve tickets
  btrfs: improve preemptive background space flushing
  btrfs: rename need_do_async_reclaim
  btrfs: check reclaim_size in need_preemptive_reclaim
  btrfs: rework btrfs_calc_reclaim_metadata_size
  btrfs: simplify the logic in need_preemptive_flushing
  btrfs: implement space clamping for preemptive flushing
  btrfs: adjust the flush trace point to include the source
  btrfs: add a trace class for dumping the current ENOSPC state

 fs/btrfs/ctree.h             |   1 +
 fs/btrfs/disk-io.c           |   1 +
 fs/btrfs/space-info.c        | 216 +++++++++++++++++++++++++++++------
 fs/btrfs/space-info.h        |   3 +
 include/trace/events/btrfs.h | 112 +++++++++++++++++-
 5 files changed, 292 insertions(+), 41 deletions(-)

Comments

Nikolay Borisov Oct. 6, 2020, 12:55 p.m. UTC | #1
On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
> Hello,
> 
> A while ago Nikolay started digging into a problem where they were seeing an
> around 20% regression on random writes, and he bisected it down to
> 
>   btrfs: don't end the transaction for delayed refs in throttle
> 
> However this wasn't actually the cause of the problem.
> 
> This patch removed the code that would preemptively end the transactions if we
> were low on space.  Because we had just introduced the ticketing code, this was
> no longer necessary and was causing a lot of transaction commits.
> 
> And in Nikolay's testing he validated this, we would see like 100x more
> transaction commits without that patch than with it, but the write regression
> clearly appeared when this patch was applied.
> 
> The root cause of this is that the transaction commits were essentially
> happening so quickly that we didn't end up needing to wait on space in the
> ENOSPC ticketing code as much, and thus were able to write pretty quickly.  With
> this gone, we now were getting a sawtoothy sort of behavior where we'd run up,
> stop while we flushed metadata space, run some more, stop again etc.
> 
> When I implemented the ticketing infrastructure, I was trying to get us out of
> excessively flushing space because we would sometimes over create block groups,
> and thus short circuited flushing if we no longer had tickets.  This had the
> side effect of breaking the preemptive flushing code, where we attempted to
> flush space in the background before we were forced to wait for space.
> 
> Enter this patchset.  We still have some of this preemption logic sprinkled
> everywhere, so I've separated it out of the normal ticketed flushing code, and
> made preemptive flushing it's own thing.
> 
> The preemptive flushing logic is more specialized than the standard flushing
> logic.  It attempts to flush in whichever pool has the highest usage.  This
> means that if most of our space is tied up in pinned extents, we'll commit the
> transaction.  If most of the space is tied up in delalloc, we'll flush delalloc,
> etc.
> 
> To test this out I used the fio job that Nikolay used, this needs to be adjusted
> so the overall IO size is at least 2x the RAM size for the box you are testing
> 
> fio --direct=0 --ioengine=sync --thread --directory=/mnt/test --invalidate=1 \
>         --group_reporting=1 --runtime=300 --fallocate=none --ramp_time=10 \
>         --name=RandomWrites-async-64512-4k-4 --new_group --rw=randwrite \
>         --size=2g --numjobs=4 --bs=4k --fsync_on_close=0 --end_fsync=0 \
>         --filename_format=FioWorkloads.\$jobnum
> 
> I got the following results
> 
> misc-next:	bw=13.4MiB/s (14.0MB/s), 13.4MiB/s-13.4MiB/s (14.0MB/s-14.0MB/s), io=4015MiB (4210MB), run=300323-300323msec
> pre-throttling:	bw=16.9MiB/s (17.7MB/s), 16.9MiB/s-16.9MiB/s (17.7MB/s-17.7MB/s), io=5068MiB (5314MB), run=300069-300069msec
> my patches:	bw=18.0MiB/s (18.9MB/s), 18.0MiB/s-18.0MiB/s (18.9MB/s-18.9MB/s), io=5403MiB (5666MB), run=300001-300001msec
> 
> Thanks,
> 
> Josef
> 
> Josef Bacik (9):
>   btrfs: add a trace point for reserve tickets
>   btrfs: improve preemptive background space flushing
>   btrfs: rename need_do_async_reclaim
>   btrfs: check reclaim_size in need_preemptive_reclaim
>   btrfs: rework btrfs_calc_reclaim_metadata_size
>   btrfs: simplify the logic in need_preemptive_flushing
>   btrfs: implement space clamping for preemptive flushing
>   btrfs: adjust the flush trace point to include the source
>   btrfs: add a trace class for dumping the current ENOSPC state
> 
>  fs/btrfs/ctree.h             |   1 +
>  fs/btrfs/disk-io.c           |   1 +
>  fs/btrfs/space-info.c        | 216 +++++++++++++++++++++++++++++------
>  fs/btrfs/space-info.h        |   3 +
>  include/trace/events/btrfs.h | 112 +++++++++++++++++-
>  5 files changed, 292 insertions(+), 41 deletions(-)
> 



Here are my results from testing misc-next/your patches/4.19 for buffered io: 

With your patches:
 WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=8192MiB (8590MB), run=281678-281678msec
 WRITE: bw=30.0MiB/s (32.5MB/s), 30.0MiB/s-30.0MiB/s (32.5MB/s-32.5MB/s), io=8192MiB (8590MB), run=264337-264337msec
 WRITE: bw=29.6MiB/s (31.1MB/s), 29.6MiB/s-29.6MiB/s (31.1MB/s-31.1MB/s), io=8192MiB (8590MB), run=276312-276312msec
 WRITE: bw=29.8MiB/s (31.2MB/s), 29.8MiB/s-29.8MiB/s (31.2MB/s-31.2MB/s), io=8192MiB (8590MB), run=274916-274916msec
 WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269030-269030msec

Without-misc-next:
  WRITE: bw=20.2MiB/s (21.2MB/s), 20.2MiB/s-20.2MiB/s (21.2MB/s-21.2MB/s), io=8192MiB (8590MB), run=404831-404831msec
  WRITE: bw=20.8MiB/s (21.8MB/s), 20.8MiB/s-20.8MiB/s (21.8MB/s-21.8MB/s), io=8192MiB (8590MB), run=394749-394749msec
  WRITE: bw=20.8MiB/s (21.8MB/s), 20.8MiB/s-20.8MiB/s (21.8MB/s-21.8MB/s), io=8192MiB (8590MB), run=393291-393291msec
  WRITE: bw=20.7MiB/s (21.8MB/s), 20.7MiB/s-20.7MiB/s (21.8MB/s-21.8MB/s), io=8192MiB (8590MB), run=394918-394918msec
  WRITE: bw=21.1MiB/s (22.1MB/s), 21.1MiB/s-21.1MiB/s (22.1MB/s-22.1MB/s), io=8192MiB (8590MB), run=388499-388499msec

4.19.x:
  WRITE: bw=23.3MiB/s (24.4MB/s), 23.3MiB/s-23.3MiB/s (24.4MB/s-24.4MB/s), io=6387MiB (6697MB), run=274460-274460msec
  WRITE: bw=23.3MiB/s (24.5MB/s), 23.3MiB/s-23.3MiB/s (24.5MB/s-24.5MB/s), io=6643MiB (6966MB), run=284518-284518msec
  WRITE: bw=23.4MiB/s (24.5MB/s), 23.4MiB/s-23.4MiB/s (24.5MB/s-24.5MB/s), io=6643MiB (6966MB), run=284372-284372msec
  WRITE: bw=23.6MiB/s (24.7MB/s), 23.6MiB/s-23.6MiB/s (24.7MB/s-24.7MB/s), io=6387MiB (6697MB), run=271200-271200msec
  WRITE: bw=23.4MiB/s (24.6MB/s), 23.4MiB/s-23.4MiB/s (24.6MB/s-24.6MB/s), io=6387MiB (6697MB), run=272670-272670msec

So there is indeed a net-gain in performance, however, (expectedly, due to the increased number of transaction commits) there is a regressions in DIO: 


DIO-josef:
  WRITE: bw=47.1MiB/s (49.4MB/s), 47.1MiB/s-47.1MiB/s (49.4MB/s-49.4MB/s), io=8192MiB (8590MB), run=174049-174049msec
  WRITE: bw=48.5MiB/s (50.8MB/s), 48.5MiB/s-48.5MiB/s (50.8MB/s-50.8MB/s), io=8192MiB (8590MB), run=169045-169045msec
  WRITE: bw=45.0MiB/s (48.2MB/s), 45.0MiB/s-45.0MiB/s (48.2MB/s-48.2MB/s), io=8192MiB (8590MB), run=178196-178196msec
  WRITE: bw=46.1MiB/s (48.3MB/s), 46.1MiB/s-46.1MiB/s (48.3MB/s-48.3MB/s), io=8192MiB (8590MB), run=177861-177861msec
  WRITE: bw=46.4MiB/s (48.7MB/s), 46.4MiB/s-46.4MiB/s (48.7MB/s-48.7MB/s), io=8192MiB (8590MB), run=176376-176376msec

DIO-misc-next:
  WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163365-163365msec
  WRITE: bw=50.3MiB/s (52.8MB/s), 50.3MiB/s-50.3MiB/s (52.8MB/s-52.8MB/s), io=8192MiB (8590MB), run=162753-162753msec
  WRITE: bw=50.6MiB/s (53.1MB/s), 50.6MiB/s-50.6MiB/s (53.1MB/s-53.1MB/s), io=8192MiB (8590MB), run=161766-161766msec
  WRITE: bw=50.2MiB/s (52.7MB/s), 50.2MiB/s-50.2MiB/s (52.7MB/s-52.7MB/s), io=8192MiB (8590MB), run=163074-163074msec
  WRITE: bw=50.5MiB/s (52.9MB/s), 50.5MiB/s-50.5MiB/s (52.9MB/s-52.9MB/s), io=8192MiB (8590MB), run=162252-162252msec


With that: 

Tested-by: Nikolay Borisov <nborisov@suse.com>