mbox series

[RFC,00/19] btrfs: async discard support

Message ID cover.1570479299.git.dennis@kernel.org (mailing list archive)
Headers show
Series btrfs: async discard support | expand

Message

Dennis Zhou Oct. 7, 2019, 8:17 p.m. UTC
Hello,

Discard is an operation that allows for the filesystem to communicate
with underlying ssds that a lba region is no longer needed. This gives
the drive the more information as it tries to manage the available free
space to minimize write amplification. However, discard hasn't been
given the most tlc. Discard is a problematic command because a drive's
physical block management is more or less a black box to us and the
effects of any particular discard aren't necessarily limited the
lifetime of a command.

Currently, btrfs handles discarding synchronously during transaction
commit. This problematically can delay transaction commit based on the
amount of space that needs to be trimmed and the efficacy of the discard
operation for a particular drive.

This series introduces async discarding, which removes discard from the
transaction commit path. While every SSD has the choice of implementing
trim support different, we strive here to do the right thing. The idea
hinges on recognizing that write amplification really only kicks in once
we're really low on free space.  As long as we trim enough to keep a
large enough pool of free space, in theory this should minimize the cost
of issuing discards on a workload and have limited cost overhead in
write amplification.

With async discard, we try to emphasize discarding larger regions
and reusing the lba (implicit discard). The first is done by using the
free space cache to maintain discard state and thus allows us to get
coalescing for fairly cheap. A background workqueue is used to scan over
an LRU kept list of the block groups. It then uses filters to determine
what to discard next hence giving priority to larger discards. While
reusing an lba isn't explicitly attempted, it happens implicitly via
find_free_extent() which if it happens to find a dirty extent, will
grant us reuse of the lba. Additionally, async discarding skips metadata
block groups as these should see a fairly high turnover as btrfs is a
self-packing filesystem being stingy with allocating new block groups
until necessary.

Preliminary results seem promising as when a lot of freeing is going on,
the discarding is delayed allowing for reuse which translates to less
discarding (in addition to the slower discarding). This has shown a
reduction in p90 and p99 read latencies on a test on our webservers.

I am currently working on tuning the rate at which it discards in the
background. I am doing this by evaluating other workloads and drives.
The iops and bps rate limits are fairly aggressive right now as my
basic survey of a few drives noted that the trim command itself is a
significant part of the overhead. So optimizing for larger trims is the
right thing to do.

Persistence isn't supported, so when we mount a filesystem, the block
groups are read in as dirty and background trim begins. This makes async
discard more useful for longer running mount points.

This series contains the following 19 patches and is on top of
btrfs-devel#master 0f1a7b3fac05:
  0001-bitmap-genericize-percpu-bitmap-region-iterators.patch
  0002-btrfs-rename-DISCARD-opt-to-DISCARD_SYNC.patch
  0003-btrfs-keep-track-of-which-extents-have-been-discarde.patch
  0004-btrfs-keep-track-of-cleanliness-of-the-bitmap.patch
  0005-btrfs-add-the-beginning-of-async-discard-discard-wor.patch
  0006-btrfs-handle-empty-block_group-removal.patch
  0007-btrfs-discard-one-region-at-a-time-in-async-discard.patch
  0008-btrfs-track-discardable-extents-for-asnyc-discard.patch
  0009-btrfs-keep-track-of-discardable_bytes.patch
  0010-btrfs-calculate-discard-delay-based-on-number-of-ext.patch
  0011-btrfs-add-bps-discard-rate-limit.patch
  0012-btrfs-limit-max-discard-size-for-async-discard.patch
  0013-btrfs-have-multiple-discard-lists.patch
  0014-btrfs-only-keep-track-of-data-extents-for-async-disc.patch
  0015-btrfs-load-block_groups-into-discard_list-on-mount.patch
  0016-btrfs-keep-track-of-discard-reuse-stats.patch
  0017-btrfs-add-async-discard-header.patch
  0018-btrfs-increase-the-metadata-allowance-for-the-free_s.patch
  0019-btrfs-make-smaller-extents-more-likely-to-go-into-bi.patch

0001 exports percpu's bitmap iterators for eventual use in 0015. 0002
renames DISCARD to DISCARD_SYNC. 0003 and 0004 adds discard tracking to
the free space cache. 0005-0007 adds the core of async discard support.
0008-0013 fiddle with stats and operation limits. 0014 makes async
discarding only track data block groups. 0015 loads block groups on
reading in the block groups. 0016 adds reuse stats. 0017 adds an
explanation header. 0018 and 0019 modify the free space cache metadata
allowance, add a bitmap -> extent path and makes us more likely to put
smaller extents into the bitmaps.

diffstats below:

Dennis Zhou (19):
  bitmap: genericize percpu bitmap region iterators
  btrfs: rename DISCARD opt to DISCARD_SYNC
  btrfs: keep track of which extents have been discarded
  btrfs: keep track of cleanliness of the bitmap
  btrfs: add the beginning of async discard, discard workqueue
  btrfs: handle empty block_group removal
  btrfs: discard one region at a time in async discard
  btrfs: track discardable extents for asnyc discard
  btrfs: keep track of discardable_bytes
  btrfs: calculate discard delay based on number of extents
  btrfs: add bps discard rate limit
  btrfs: limit max discard size for async discard
  btrfs: have multiple discard lists
  btrfs: only keep track of data extents for async discard
  btrfs: load block_groups into discard_list on mount
  btrfs: keep track of discard reuse stats
  btrfs: add async discard header
  btrfs: increase the metadata allowance for the free_space_cache
  btrfs: make smaller extents more likely to go into bitmaps

 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/block-group.c      |  47 +++-
 fs/btrfs/block-group.h      |  18 ++
 fs/btrfs/ctree.h            |  29 ++-
 fs/btrfs/discard.c          | 458 +++++++++++++++++++++++++++++++++
 fs/btrfs/discard.h          | 112 ++++++++
 fs/btrfs/disk-io.c          |  15 +-
 fs/btrfs/extent-tree.c      |  26 +-
 fs/btrfs/free-space-cache.c | 494 ++++++++++++++++++++++++++++++------
 fs/btrfs/free-space-cache.h |  27 +-
 fs/btrfs/inode-map.c        |  13 +-
 fs/btrfs/scrub.c            |   7 +-
 fs/btrfs/super.c            |  39 ++-
 fs/btrfs/sysfs.c            | 141 ++++++++++
 include/linux/bitmap.h      |  35 +++
 mm/percpu.c                 |  61 ++---
 16 files changed, 1369 insertions(+), 155 deletions(-)
 create mode 100644 fs/btrfs/discard.c
 create mode 100644 fs/btrfs/discard.h

Thanks,
Dennis

Comments

Nikolay Borisov Oct. 11, 2019, 7:49 a.m. UTC | #1
On 7.10.19 г. 23:17 ч., Dennis Zhou wrote:
> Hello,
> 

<snip>

> 
> With async discard, we try to emphasize discarding larger regions
> and reusing the lba (implicit discard). The first is done by using the
> free space cache to maintain discard state and thus allows us to get
> coalescing for fairly cheap. A background workqueue is used to scan over
> an LRU kept list of the block groups. It then uses filters to determine
> what to discard next hence giving priority to larger discards. While
> reusing an lba isn't explicitly attempted, it happens implicitly via
> find_free_extent() which if it happens to find a dirty extent, will
> grant us reuse of the lba. Additionally, async discarding skips metadata

By 'dirty' I assume you mean not-discarded-yet-but-free extent?

> block groups as these should see a fairly high turnover as btrfs is a
> self-packing filesystem being stingy with allocating new block groups
> until necessary.
> 
> Preliminary results seem promising as when a lot of freeing is going on,
> the discarding is delayed allowing for reuse which translates to less
> discarding (in addition to the slower discarding). This has shown a
> reduction in p90 and p99 read latencies on a test on our webservers.
> 
> I am currently working on tuning the rate at which it discards in the
> background. I am doing this by evaluating other workloads and drives.
> The iops and bps rate limits are fairly aggressive right now as my
> basic survey of a few drives noted that the trim command itself is a
> significant part of the overhead. So optimizing for larger trims is the
> right thing to do.

Do you intend on sharing performance results alongside the workloads
used to obtain them? Since this is a performance improvement patch in
its core that is of prime importance!

> 

<snip>
> 
> Thanks,
> Dennis
>
Dennis Zhou Oct. 14, 2019, 9:05 p.m. UTC | #2
On Fri, Oct 11, 2019 at 10:49:20AM +0300, Nikolay Borisov wrote:
> 
> 
> On 7.10.19 г. 23:17 ч., Dennis Zhou wrote:
> > Hello,
> > 
> 
> <snip>
> 
> > 
> > With async discard, we try to emphasize discarding larger regions
> > and reusing the lba (implicit discard). The first is done by using the
> > free space cache to maintain discard state and thus allows us to get
> > coalescing for fairly cheap. A background workqueue is used to scan over
> > an LRU kept list of the block groups. It then uses filters to determine
> > what to discard next hence giving priority to larger discards. While
> > reusing an lba isn't explicitly attempted, it happens implicitly via
> > find_free_extent() which if it happens to find a dirty extent, will
> > grant us reuse of the lba. Additionally, async discarding skips metadata
> 
> By 'dirty' I assume you mean not-discarded-yet-but-free extent?
> 

Yes.

> > block groups as these should see a fairly high turnover as btrfs is a
> > self-packing filesystem being stingy with allocating new block groups
> > until necessary.
> > 
> > Preliminary results seem promising as when a lot of freeing is going on,
> > the discarding is delayed allowing for reuse which translates to less
> > discarding (in addition to the slower discarding). This has shown a
> > reduction in p90 and p99 read latencies on a test on our webservers.
> > 
> > I am currently working on tuning the rate at which it discards in the
> > background. I am doing this by evaluating other workloads and drives.
> > The iops and bps rate limits are fairly aggressive right now as my
> > basic survey of a few drives noted that the trim command itself is a
> > significant part of the overhead. So optimizing for larger trims is the
> > right thing to do.
> 
> Do you intend on sharing performance results alongside the workloads
> used to obtain them? Since this is a performance improvement patch in
> its core that is of prime importance!
> 

I'll try and find some stuff to share for v2. As I'm just running this
on production machines, I don't intend to share any workloads. However,
there is an iocost workload that demonstrates the problem nicely that
might already be shared.

The win really is moving the work from transaction commit to completely
background work, effectively making discard a 2nd class citizen. On more
loaded machines, it's not great that discards are blocking transaction
commit. The other thing is it's very drive dependent. Some drives just
have really bad discard implementations and there will be a bigger win
than say on some high end nvme drive.

> > 
> 
> <snip>
> > 
> > Thanks,
> > Dennis
> >
David Sterba Oct. 15, 2019, 12:08 p.m. UTC | #3
Hi,

thanks for working on this. The plain -odiscard hasn't been recommended
to users for a long time, even with the SATA 3.1 drives that allow
queueing the requests.

The overall approach to async discard sounds good, the hard part is not
shoot down the filesystem by trimming live data, and we had a bug like
that in the pastlive data, and we had a bug like that in the past. For
correctness reasons I understand the size of the patchset.

On Mon, Oct 07, 2019 at 04:17:31PM -0400, Dennis Zhou wrote:
> I am currently working on tuning the rate at which it discards in the
> background. I am doing this by evaluating other workloads and drives.
> The iops and bps rate limits are fairly aggressive right now as my
> basic survey of a few drives noted that the trim command itself is a
> significant part of the overhead. So optimizing for larger trims is the
> right thing to do.

We need a sane default behaviour, without the need for knobs and
configuration, so it's great you can have a wide range of samples to
tune it.

As trim is only a hint, a short delay in processing the requests or
slight ineffectivity should be acceptable, to avoid complications in the
code or interfering with other IO.

> Persistence isn't supported, so when we mount a filesystem, the block
> groups are read in as dirty and background trim begins. This makes async
> discard more useful for longer running mount points.

I think this is acceptable.

Regarding the code, I leave comments to the block group and trim
structures to Josef and will focus more on the low-level and coding
style or changelogs.
Dennis Zhou Oct. 15, 2019, 3:41 p.m. UTC | #4
On Tue, Oct 15, 2019 at 02:08:31PM +0200, David Sterba wrote:
> Hi,
> 
> thanks for working on this. The plain -odiscard hasn't been recommended
> to users for a long time, even with the SATA 3.1 drives that allow
> queueing the requests.
> 
> The overall approach to async discard sounds good, the hard part is not
> shoot down the filesystem by trimming live data, and we had a bug like
> that in the pastlive data, and we had a bug like that in the past. For
> correctness reasons I understand the size of the patchset.
> 
> On Mon, Oct 07, 2019 at 04:17:31PM -0400, Dennis Zhou wrote:
> > I am currently working on tuning the rate at which it discards in the
> > background. I am doing this by evaluating other workloads and drives.
> > The iops and bps rate limits are fairly aggressive right now as my
> > basic survey of a few drives noted that the trim command itself is a
> > significant part of the overhead. So optimizing for larger trims is the
> > right thing to do.
> 
> We need a sane default behaviour, without the need for knobs and
> configuration, so it's great you can have a wide range of samples to
> tune it.
> 

Yeah I'm just not quite sure what that is yet. The tricky part is that
we don't really get burned by trims until well after we've issued the
trims. The feedback loop is not really transparent with reads and
writes.

> As trim is only a hint, a short delay in processing the requests or
> slight ineffectivity should be acceptable, to avoid complications in the
> code or interfering with other IO.
> 
> > Persistence isn't supported, so when we mount a filesystem, the block
> > groups are read in as dirty and background trim begins. This makes async
> > discard more useful for longer running mount points.
> 
> I think this is acceptable.
> 
> Regarding the code, I leave comments to the block group and trim
> structures to Josef and will focus more on the low-level and coding
> style or changelogs.

Sounds good. I'll hopefully post a v2 shortly so that we can get more of
that out of the way. Then hopefully a smaller incremental later to
figure out the configuration part.

Thanks,
Dennis