mbox series

[RFC,0/9] btrfs: simple quotas

Message ID cover.1683075170.git.boris@bur.io (mailing list archive)
Headers show
Series btrfs: simple quotas | expand

Message

Boris Burkov May 3, 2023, 12:59 a.m. UTC
btrfs quota groups (qgroups) are a compelling feature of btrfs that
allow flexible control for limiting subvolume data and metadata usage.
However, due to btrfs's high level decision to tradeoff snapshot
performance against ref-counting performance, qgroups suffer from
non-trivial performance issues that make them unattractive in certain
workloads. Particularly, frequent backref walking during writes and
during commits can make operations increasingly expensive as the number
of snapshots scales up. For that reason, we have never been able to
commit to using qgroups in production at Meta, despite significant
interest from people running container workloads, where we would benefit
from protecting the rest of the host from a buggy application in a
container running away with disk usage.

This patch series introduces a simplified version of qgroups called
simple quotas (squotas) which never computes global reference counts
for extents, and thus has similar performance characteristics to normal,
quotas disabled, btrfs. The "trick" is that in simple quotas mode, we
account all extents permanently to the subvolume in which they were
originally created. That allows us to make all accounting 1:1 with
extent item lifetime, removing the need to walk backrefs. However, this
sacrifices the ability to compute shared vs. exclusive usage. It also
results in counter-intuitive, though still predictable and simple,
accounting in the cases where an original extent is removed while a
shared copy still exists. Qgroups is able to detect that case and count
the remaining copy as an exclusive owner, while squotas is not. As a
result, squotas works best when the original extent is immutable and
outlives any clones.

In order to track the original creating subvolume of a data extent in
the face of reflinks, it is necessary to add additional accounting to
the extent item. To save space, this is done with a new inline ref item.
However, the downside of this approach is that it makes enabling squota
an incompat change, denoted by the new incompat bit SIMPLE_QUOTA. When
this bit is set and quotas are enabled, new extent items get the extra
accounting, and freed extent items check for the accounting to find
their creating subvolume.

Squotas reuses the api of qgroups. The only difference is that when you
enable quotas via `btrfs quota enable`, you pass the `--simple` flag.
Squotas will always report exclusive == shared for each qgroup.

This is still a preliminary RFC patch series, so not all the ducks are
fully in a row. In particular, some userspace parts are missing, like
meaningful integration with fsck, which will drive further testing.

My current branches for btrfs-progs and fstests do contain some (sloppy)
minimal support needed to run and test the feature:
btrfs-progs: https://github.com/boryas/btrfs-progs/tree/squota-progs
fstests: https://github.com/boryas/fstests/tree/squota-test

Current testing methodology:
- New fstest (btrfs/400 in the squota-test branch)
- Run all fstests with squota enabled at mkfs time. Not all tests are
  passing in this regime, though this is actually true of qgroups as
  well. Most of the issues have to do with leaking reserved space in
  less commonly tested cases like I/O failures. My intent is to get this
  test suite fully passing.
- Run all fstests without squota enabled at mkfs time

Basic performance test:
In this test, I ran a workload which generated K files in a subvolume,
then took L snapshots of that subvolume, then unshared each file in
each subvolume. The measurement is just total walltime. K is the row
index and L the column index, so in these tables, we vary between 1
and 100 files and 1 and 10000 snapshots. The "n" table is no quotas,
the "q" table is qgroups and the "s" table is squotas. As you can see,
"n" and "s" are quite similar, while "q" falls of a cliff as the
number of snapshots increases. More sophisticated and realistic
performance testing that doesn't abuse such an insane number of
snapshots is still to come.

n
        1       10      100     1000    10000
1       0.18    0.24    1.58    16.49   211.34
10      0.28    0.43    2.80    29.74   324.70
100     0.55    0.99    6.57    65.13   717.51

q
        1       10      100     1000    10000
1       2.19    0.35    2.32    25.78   756.62
10      0.34    0.48    3.24    68.72   3731.73
100     0.64    0.80    7.63    215.54  14170.73

s
        1       10      100     1000    10000
1       2.19    0.32    1.83    19.19   231.75
10      0.31    0.43    2.86    28.86   351.42
100     0.70    0.90    6.75    67.89   742.93


Boris Burkov (9):
  btrfs: simple quotas mode
  btrfs: new function for recording simple quota usage
  btrfs: track original extent subvol in a new inline ref
  btrfs: track metadata owning root in delayed refs
  btrfs: record simple quota deltas
  btrfs: auto hierarchy for simple qgroups of nested subvols
  btrfs: check generation when recording simple quota delta
  btrfs: expose the qgroup mode via sysfs
  btrfs: free qgroup rsv on io failure

 fs/btrfs/accessors.h            |   6 +
 fs/btrfs/backref.c              |   3 +
 fs/btrfs/delayed-ref.c          |  13 +-
 fs/btrfs/delayed-ref.h          |  28 ++++-
 fs/btrfs/extent-tree.c          | 143 +++++++++++++++++----
 fs/btrfs/fs.h                   |   7 +-
 fs/btrfs/ioctl.c                |   4 +-
 fs/btrfs/ordered-data.c         |   6 +-
 fs/btrfs/print-tree.c           |  12 ++
 fs/btrfs/qgroup.c               | 216 +++++++++++++++++++++++++++++---
 fs/btrfs/qgroup.h               |  29 ++++-
 fs/btrfs/ref-verify.c           |   3 +
 fs/btrfs/relocation.c           |  11 +-
 fs/btrfs/sysfs.c                |  26 ++++
 fs/btrfs/transaction.c          |  11 +-
 fs/btrfs/tree-checker.c         |   3 +
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |  13 ++
 18 files changed, 471 insertions(+), 64 deletions(-)

Comments

Anand Jain May 5, 2023, 4:13 a.m. UTC | #1
On 3/5/23 08:59, Boris Burkov wrote:
> btrfs quota groups (qgroups) are a compelling feature of btrfs that
> allow flexible control for limiting subvolume data and metadata usage.
> However, due to btrfs's high level decision to tradeoff snapshot
> performance against ref-counting performance, qgroups suffer from
> non-trivial performance issues that make them unattractive in certain
> workloads. Particularly, frequent backref walking during writes and
> during commits can make operations increasingly expensive as the number
> of snapshots scales up. For that reason, we have never been able to
> commit to using qgroups in production at Meta, despite significant
> interest from people running container workloads, where we would benefit
> from protecting the rest of the host from a buggy application in a
> container running away with disk usage.
> 
> This patch series introduces a simplified version of qgroups called
> simple quotas (squotas) which never computes global reference counts
> for extents, and thus has similar performance characteristics to normal,
> quotas disabled, btrfs. The "trick" is that in simple quotas mode, we
> account all extents permanently to the subvolume in which they were
> originally created. That allows us to make all accounting 1:1 with
> extent item lifetime, removing the need to walk backrefs. However, this
> sacrifices the ability to compute shared vs. exclusive usage. It also
> results in counter-intuitive, though still predictable and simple,
> accounting in the cases where an original extent is removed while a
> shared copy still exists. Qgroups is able to detect that case and count
> the remaining copy as an exclusive owner, while squotas is not. As a
> result, squotas works best when the original extent is immutable and
> outlives any clones.
> 
> In order to track the original creating subvolume of a data extent in
> the face of reflinks, it is necessary to add additional accounting to
> the extent item. To save space, this is done with a new inline ref item.
> However, the downside of this approach is that it makes enabling squota
> an incompat change, denoted by the new incompat bit SIMPLE_QUOTA. When
> this bit is set and quotas are enabled, new extent items get the extra
> accounting, and freed extent items check for the accounting to find
> their creating subvolume.
> 
> Squotas reuses the api of qgroups. The only difference is that when you
> enable quotas via `btrfs quota enable`, you pass the `--simple` flag.
> Squotas will always report exclusive == shared for each qgroup.
> 
> This is still a preliminary RFC patch series, so not all the ducks are
> fully in a row. In particular, some userspace parts are missing, like
> meaningful integration with fsck, which will drive further testing.
> 
> My current branches for btrfs-progs and fstests do contain some (sloppy)
> minimal support needed to run and test the feature:
> btrfs-progs: https://github.com/boryas/btrfs-progs/tree/squota-progs
> fstests: https://github.com/boryas/fstests/tree/squota-test
> 
> Current testing methodology:
> - New fstest (btrfs/400 in the squota-test branch)
> - Run all fstests with squota enabled at mkfs time. Not all tests are
>    passing in this regime, though this is actually true of qgroups as
>    well. Most of the issues have to do with leaking reserved space in
>    less commonly tested cases like I/O failures. My intent is to get this
>    test suite fully passing.
> - Run all fstests without squota enabled at mkfs time
> 
> Basic performance test:

> In this test, I ran a workload which generated K files in a subvolume,
> then took L snapshots of that subvolume, then unshared each file in
> each subvolume.

Can you pls provide a link to the test script?
I couldn't find it in the links mentioned above.

Thanks, Anand

  The measurement is just total walltime. K is the row
> index and L the column index, so in these tables, we vary between 1
> and 100 files and 1 and 10000 snapshots. The "n" table is no quotas,
> the "q" table is qgroups and the "s" table is squotas. As you can see,
> "n" and "s" are quite similar, while "q" falls of a cliff as the
> number of snapshots increases. More sophisticated and realistic
> performance testing that doesn't abuse such an insane number of
> snapshots is still to come.
> 
> n
>          1       10      100     1000    10000
> 1       0.18    0.24    1.58    16.49   211.34
> 10      0.28    0.43    2.80    29.74   324.70
> 100     0.55    0.99    6.57    65.13   717.51
> 
> q
>          1       10      100     1000    10000
> 1       2.19    0.35    2.32    25.78   756.62
> 10      0.34    0.48    3.24    68.72   3731.73
> 100     0.64    0.80    7.63    215.54  14170.73
> 
> s
>          1       10      100     1000    10000
> 1       2.19    0.32    1.83    19.19   231.75
> 10      0.31    0.43    2.86    28.86   351.42
> 100     0.70    0.90    6.75    67.89   742.93
> 
> 
> Boris Burkov (9):
>    btrfs: simple quotas mode
>    btrfs: new function for recording simple quota usage
>    btrfs: track original extent subvol in a new inline ref
>    btrfs: track metadata owning root in delayed refs
>    btrfs: record simple quota deltas
>    btrfs: auto hierarchy for simple qgroups of nested subvols
>    btrfs: check generation when recording simple quota delta
>    btrfs: expose the qgroup mode via sysfs
>    btrfs: free qgroup rsv on io failure
> 
>   fs/btrfs/accessors.h            |   6 +
>   fs/btrfs/backref.c              |   3 +
>   fs/btrfs/delayed-ref.c          |  13 +-
>   fs/btrfs/delayed-ref.h          |  28 ++++-
>   fs/btrfs/extent-tree.c          | 143 +++++++++++++++++----
>   fs/btrfs/fs.h                   |   7 +-
>   fs/btrfs/ioctl.c                |   4 +-
>   fs/btrfs/ordered-data.c         |   6 +-
>   fs/btrfs/print-tree.c           |  12 ++
>   fs/btrfs/qgroup.c               | 216 +++++++++++++++++++++++++++++---
>   fs/btrfs/qgroup.h               |  29 ++++-
>   fs/btrfs/ref-verify.c           |   3 +
>   fs/btrfs/relocation.c           |  11 +-
>   fs/btrfs/sysfs.c                |  26 ++++
>   fs/btrfs/transaction.c          |  11 +-
>   fs/btrfs/tree-checker.c         |   3 +
>   include/uapi/linux/btrfs.h      |   1 +
>   include/uapi/linux/btrfs_tree.h |  13 ++
>   18 files changed, 471 insertions(+), 64 deletions(-)
>
Boris Burkov May 10, 2023, 1:09 a.m. UTC | #2
On Fri, May 05, 2023 at 12:13:58PM +0800, Anand Jain wrote:
> On 3/5/23 08:59, Boris Burkov wrote:
> > btrfs quota groups (qgroups) are a compelling feature of btrfs that
> > allow flexible control for limiting subvolume data and metadata usage.
> > However, due to btrfs's high level decision to tradeoff snapshot
> > performance against ref-counting performance, qgroups suffer from
> > non-trivial performance issues that make them unattractive in certain
> > workloads. Particularly, frequent backref walking during writes and
> > during commits can make operations increasingly expensive as the number
> > of snapshots scales up. For that reason, we have never been able to
> > commit to using qgroups in production at Meta, despite significant
> > interest from people running container workloads, where we would benefit
> > from protecting the rest of the host from a buggy application in a
> > container running away with disk usage.
> > 
> > This patch series introduces a simplified version of qgroups called
> > simple quotas (squotas) which never computes global reference counts
> > for extents, and thus has similar performance characteristics to normal,
> > quotas disabled, btrfs. The "trick" is that in simple quotas mode, we
> > account all extents permanently to the subvolume in which they were
> > originally created. That allows us to make all accounting 1:1 with
> > extent item lifetime, removing the need to walk backrefs. However, this
> > sacrifices the ability to compute shared vs. exclusive usage. It also
> > results in counter-intuitive, though still predictable and simple,
> > accounting in the cases where an original extent is removed while a
> > shared copy still exists. Qgroups is able to detect that case and count
> > the remaining copy as an exclusive owner, while squotas is not. As a
> > result, squotas works best when the original extent is immutable and
> > outlives any clones.
> > 
> > In order to track the original creating subvolume of a data extent in
> > the face of reflinks, it is necessary to add additional accounting to
> > the extent item. To save space, this is done with a new inline ref item.
> > However, the downside of this approach is that it makes enabling squota
> > an incompat change, denoted by the new incompat bit SIMPLE_QUOTA. When
> > this bit is set and quotas are enabled, new extent items get the extra
> > accounting, and freed extent items check for the accounting to find
> > their creating subvolume.
> > 
> > Squotas reuses the api of qgroups. The only difference is that when you
> > enable quotas via `btrfs quota enable`, you pass the `--simple` flag.
> > Squotas will always report exclusive == shared for each qgroup.
> > 
> > This is still a preliminary RFC patch series, so not all the ducks are
> > fully in a row. In particular, some userspace parts are missing, like
> > meaningful integration with fsck, which will drive further testing.
> > 
> > My current branches for btrfs-progs and fstests do contain some (sloppy)
> > minimal support needed to run and test the feature:
> > btrfs-progs: https://github.com/boryas/btrfs-progs/tree/squota-progs
> > fstests: https://github.com/boryas/fstests/tree/squota-test
> > 
> > Current testing methodology:
> > - New fstest (btrfs/400 in the squota-test branch)
> > - Run all fstests with squota enabled at mkfs time. Not all tests are
> >    passing in this regime, though this is actually true of qgroups as
> >    well. Most of the issues have to do with leaking reserved space in
> >    less commonly tested cases like I/O failures. My intent is to get this
> >    test suite fully passing.
> > - Run all fstests without squota enabled at mkfs time
> > 
> > Basic performance test:
> 
> > In this test, I ran a workload which generated K files in a subvolume,
> > then took L snapshots of that subvolume, then unshared each file in
> > each subvolume.
> 
> Can you pls provide a link to the test script?
> I couldn't find it in the links mentioned above.
> 
> Thanks, Anand

Hey Anand,

Sorry I missed this message and haven't replied yet.

The reason I didn't share the scripts is that they are sort of spread
across a couple different places and I haven't collected them into a
shareable form (probably in fsperf, eventually)

The repo which has some scripts to scale snapshots, extents and files:
https://github.com/boryas/scripts/tree/main/sh/btrfs-scale
(which depends on my silly personal shell script infra to really
run... https://github.com/boryas/clitools)
and then I don't currently have the script that puts it all together,
(forgot to push and its on a diff computer but I'm at LSFMMBPF...)
but IIRC it does something like:

mkfs on an nvme; mount it
make a subvol
put K files into the subvol with the files scaling script
make a snapshot dir
make L identical snapshots in that dir (no snapshots of snapshots)
touch 4k of each of the 8k files with dd to unshare

If you'd like to play around with the workload before then, but this
isn't enough info, let me know and I'll share more with you ASAP. I
think you should be able to slap something together with this, though.

TL;DR: it's a dumb script that was just a PoC to convince me the basic
scaling was what I expected. But I promise to better share my
methodology with the next time I send the patch series.

Thanks for your interest!

Boris

> 
>  The measurement is just total walltime. K is the row
> > index and L the column index, so in these tables, we vary between 1
> > and 100 files and 1 and 10000 snapshots. The "n" table is no quotas,
> > the "q" table is qgroups and the "s" table is squotas. As you can see,
> > "n" and "s" are quite similar, while "q" falls of a cliff as the
> > number of snapshots increases. More sophisticated and realistic
> > performance testing that doesn't abuse such an insane number of
> > snapshots is still to come.
> > 
> > n
> >          1       10      100     1000    10000
> > 1       0.18    0.24    1.58    16.49   211.34
> > 10      0.28    0.43    2.80    29.74   324.70
> > 100     0.55    0.99    6.57    65.13   717.51
> > 
> > q
> >          1       10      100     1000    10000
> > 1       2.19    0.35    2.32    25.78   756.62
> > 10      0.34    0.48    3.24    68.72   3731.73
> > 100     0.64    0.80    7.63    215.54  14170.73
> > 
> > s
> >          1       10      100     1000    10000
> > 1       2.19    0.32    1.83    19.19   231.75
> > 10      0.31    0.43    2.86    28.86   351.42
> > 100     0.70    0.90    6.75    67.89   742.93
> > 
> > 
> > Boris Burkov (9):
> >    btrfs: simple quotas mode
> >    btrfs: new function for recording simple quota usage
> >    btrfs: track original extent subvol in a new inline ref
> >    btrfs: track metadata owning root in delayed refs
> >    btrfs: record simple quota deltas
> >    btrfs: auto hierarchy for simple qgroups of nested subvols
> >    btrfs: check generation when recording simple quota delta
> >    btrfs: expose the qgroup mode via sysfs
> >    btrfs: free qgroup rsv on io failure
> > 
> >   fs/btrfs/accessors.h            |   6 +
> >   fs/btrfs/backref.c              |   3 +
> >   fs/btrfs/delayed-ref.c          |  13 +-
> >   fs/btrfs/delayed-ref.h          |  28 ++++-
> >   fs/btrfs/extent-tree.c          | 143 +++++++++++++++++----
> >   fs/btrfs/fs.h                   |   7 +-
> >   fs/btrfs/ioctl.c                |   4 +-
> >   fs/btrfs/ordered-data.c         |   6 +-
> >   fs/btrfs/print-tree.c           |  12 ++
> >   fs/btrfs/qgroup.c               | 216 +++++++++++++++++++++++++++++---
> >   fs/btrfs/qgroup.h               |  29 ++++-
> >   fs/btrfs/ref-verify.c           |   3 +
> >   fs/btrfs/relocation.c           |  11 +-
> >   fs/btrfs/sysfs.c                |  26 ++++
> >   fs/btrfs/transaction.c          |  11 +-
> >   fs/btrfs/tree-checker.c         |   3 +
> >   include/uapi/linux/btrfs.h      |   1 +
> >   include/uapi/linux/btrfs_tree.h |  13 ++
> >   18 files changed, 471 insertions(+), 64 deletions(-)
> > 
> 
>