Message ID | cover.1570479299.git.dennis@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | btrfs: async discard support | expand |
On 7.10.19 г. 23:17 ч., Dennis Zhou wrote: > Hello, > <snip> > > With async discard, we try to emphasize discarding larger regions > and reusing the lba (implicit discard). The first is done by using the > free space cache to maintain discard state and thus allows us to get > coalescing for fairly cheap. A background workqueue is used to scan over > an LRU kept list of the block groups. It then uses filters to determine > what to discard next hence giving priority to larger discards. While > reusing an lba isn't explicitly attempted, it happens implicitly via > find_free_extent() which if it happens to find a dirty extent, will > grant us reuse of the lba. Additionally, async discarding skips metadata By 'dirty' I assume you mean not-discarded-yet-but-free extent? > block groups as these should see a fairly high turnover as btrfs is a > self-packing filesystem being stingy with allocating new block groups > until necessary. > > Preliminary results seem promising as when a lot of freeing is going on, > the discarding is delayed allowing for reuse which translates to less > discarding (in addition to the slower discarding). This has shown a > reduction in p90 and p99 read latencies on a test on our webservers. > > I am currently working on tuning the rate at which it discards in the > background. I am doing this by evaluating other workloads and drives. > The iops and bps rate limits are fairly aggressive right now as my > basic survey of a few drives noted that the trim command itself is a > significant part of the overhead. So optimizing for larger trims is the > right thing to do. Do you intend on sharing performance results alongside the workloads used to obtain them? Since this is a performance improvement patch in its core that is of prime importance! > <snip> > > Thanks, > Dennis >
On Fri, Oct 11, 2019 at 10:49:20AM +0300, Nikolay Borisov wrote: > > > On 7.10.19 г. 23:17 ч., Dennis Zhou wrote: > > Hello, > > > > <snip> > > > > > With async discard, we try to emphasize discarding larger regions > > and reusing the lba (implicit discard). The first is done by using the > > free space cache to maintain discard state and thus allows us to get > > coalescing for fairly cheap. A background workqueue is used to scan over > > an LRU kept list of the block groups. It then uses filters to determine > > what to discard next hence giving priority to larger discards. While > > reusing an lba isn't explicitly attempted, it happens implicitly via > > find_free_extent() which if it happens to find a dirty extent, will > > grant us reuse of the lba. Additionally, async discarding skips metadata > > By 'dirty' I assume you mean not-discarded-yet-but-free extent? > Yes. > > block groups as these should see a fairly high turnover as btrfs is a > > self-packing filesystem being stingy with allocating new block groups > > until necessary. > > > > Preliminary results seem promising as when a lot of freeing is going on, > > the discarding is delayed allowing for reuse which translates to less > > discarding (in addition to the slower discarding). This has shown a > > reduction in p90 and p99 read latencies on a test on our webservers. > > > > I am currently working on tuning the rate at which it discards in the > > background. I am doing this by evaluating other workloads and drives. > > The iops and bps rate limits are fairly aggressive right now as my > > basic survey of a few drives noted that the trim command itself is a > > significant part of the overhead. So optimizing for larger trims is the > > right thing to do. > > Do you intend on sharing performance results alongside the workloads > used to obtain them? Since this is a performance improvement patch in > its core that is of prime importance! > I'll try and find some stuff to share for v2. As I'm just running this on production machines, I don't intend to share any workloads. However, there is an iocost workload that demonstrates the problem nicely that might already be shared. The win really is moving the work from transaction commit to completely background work, effectively making discard a 2nd class citizen. On more loaded machines, it's not great that discards are blocking transaction commit. The other thing is it's very drive dependent. Some drives just have really bad discard implementations and there will be a bigger win than say on some high end nvme drive. > > > > <snip> > > > > Thanks, > > Dennis > >
Hi, thanks for working on this. The plain -odiscard hasn't been recommended to users for a long time, even with the SATA 3.1 drives that allow queueing the requests. The overall approach to async discard sounds good, the hard part is not shoot down the filesystem by trimming live data, and we had a bug like that in the pastlive data, and we had a bug like that in the past. For correctness reasons I understand the size of the patchset. On Mon, Oct 07, 2019 at 04:17:31PM -0400, Dennis Zhou wrote: > I am currently working on tuning the rate at which it discards in the > background. I am doing this by evaluating other workloads and drives. > The iops and bps rate limits are fairly aggressive right now as my > basic survey of a few drives noted that the trim command itself is a > significant part of the overhead. So optimizing for larger trims is the > right thing to do. We need a sane default behaviour, without the need for knobs and configuration, so it's great you can have a wide range of samples to tune it. As trim is only a hint, a short delay in processing the requests or slight ineffectivity should be acceptable, to avoid complications in the code or interfering with other IO. > Persistence isn't supported, so when we mount a filesystem, the block > groups are read in as dirty and background trim begins. This makes async > discard more useful for longer running mount points. I think this is acceptable. Regarding the code, I leave comments to the block group and trim structures to Josef and will focus more on the low-level and coding style or changelogs.
On Tue, Oct 15, 2019 at 02:08:31PM +0200, David Sterba wrote: > Hi, > > thanks for working on this. The plain -odiscard hasn't been recommended > to users for a long time, even with the SATA 3.1 drives that allow > queueing the requests. > > The overall approach to async discard sounds good, the hard part is not > shoot down the filesystem by trimming live data, and we had a bug like > that in the pastlive data, and we had a bug like that in the past. For > correctness reasons I understand the size of the patchset. > > On Mon, Oct 07, 2019 at 04:17:31PM -0400, Dennis Zhou wrote: > > I am currently working on tuning the rate at which it discards in the > > background. I am doing this by evaluating other workloads and drives. > > The iops and bps rate limits are fairly aggressive right now as my > > basic survey of a few drives noted that the trim command itself is a > > significant part of the overhead. So optimizing for larger trims is the > > right thing to do. > > We need a sane default behaviour, without the need for knobs and > configuration, so it's great you can have a wide range of samples to > tune it. > Yeah I'm just not quite sure what that is yet. The tricky part is that we don't really get burned by trims until well after we've issued the trims. The feedback loop is not really transparent with reads and writes. > As trim is only a hint, a short delay in processing the requests or > slight ineffectivity should be acceptable, to avoid complications in the > code or interfering with other IO. > > > Persistence isn't supported, so when we mount a filesystem, the block > > groups are read in as dirty and background trim begins. This makes async > > discard more useful for longer running mount points. > > I think this is acceptable. > > Regarding the code, I leave comments to the block group and trim > structures to Josef and will focus more on the low-level and coding > style or changelogs. Sounds good. I'll hopefully post a v2 shortly so that we can get more of that out of the way. Then hopefully a smaller incremental later to figure out the configuration part. Thanks, Dennis