Message ID | 20210110160720.3922965-1-chandanrlinux@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Bail out if transaction can cause extent count to overflow | expand |
On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote: > > XFS does not check for possible overflow of per-inode extent counter > fields when adding extents to either data or attr fork. > > For e.g. > 1. Insert 5 million xattrs (each having a value size of 255 bytes) and > then delete 50% of them in an alternating manner. > > 2. On a 4k block sized XFS filesystem instance, the above causes 98511 > extents to be created in the attr fork of the inode. > > xfsaild/loop0 2008 [003] 1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131 > > 3. The incore inode fork extent counter is a signed 32-bit > quantity. However, the on-disk extent counter is an unsigned 16-bit > quantity and hence cannot hold 98511 extents. > > 4. The following incorrect value is stored in the xattr extent counter, > # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0 > core.naextents = -32561 > > This patchset adds a new helper function > (i.e. xfs_iext_count_may_overflow()) to check for overflow of the > per-inode data and xattr extent counters and invokes it before > starting an fs operation (e.g. creating a new directory entry). With > this patchset applied, XFS detects counter overflows and returns with > an error rather than causing a silent corruption. > > The patchset has been tested by executing xfstests with the following > mkfs.xfs options, > 1. -m crc=0 -b size=1k > 2. -m crc=0 -b size=4k > 3. -m crc=0 -b size=512 > 4. -m rmapbt=1,reflink=1 -b size=1k > 5. -m rmapbt=1,reflink=1 -b size=4k > > The patches can also be obtained from > https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14. > > I have two patches that define the newly introduced error injection > tags in xfsprogs > (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/). > > I have also written tests > (https://github.com/chandanr/xfstests/commits/extent-overflow-tests) > for verifying the checks introduced in the kernel. > Hi Chandan and XFS folks, As you may have heard, I am working on producing a series of xfs patches for stable v5.10.y. My patch selection is documented at [1]. I am in the process of testing the backport patches against the 5.10.y baseline using Luis' kdevops [2] fstests runner. The configurations that we are testing are: 1. -m rmbat=0,reflink=1 -b size=4k (default) 2. -m crc=0 -b size=4k 3. -m crc=0 -b size=512 4. -m rmapbt=1,reflink=1 -b size=1k 5. -m rmapbt=1,reflink=1 -b size=4k This patch set is the only largish series that I selected, because: - It applies cleanly to 5.10.y - I evaluated it as low risk and high value - Chandan has written good regression tests I intend to post the rest of the individual selected patches for review in small batches after they pass the tests, but w.r.t this patch set - Does anyone object to including it in the stable kernel after it passes the tests? Thanks, Amir. [1] https://github.com/amir73il/b4/blob/xfs-5.10.y/xfs-5.10..5.17-fixes.rst [2] https://github.com/linux-kdevops/kdevops
On Mon, May 23, 2022 at 02:15:44 PM +0300, Amir Goldstein wrote: > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote: >> >> XFS does not check for possible overflow of per-inode extent counter >> fields when adding extents to either data or attr fork. >> >> For e.g. >> 1. Insert 5 million xattrs (each having a value size of 255 bytes) and >> then delete 50% of them in an alternating manner. >> >> 2. On a 4k block sized XFS filesystem instance, the above causes 98511 >> extents to be created in the attr fork of the inode. >> >> xfsaild/loop0 2008 [003] 1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131 >> >> 3. The incore inode fork extent counter is a signed 32-bit >> quantity. However, the on-disk extent counter is an unsigned 16-bit >> quantity and hence cannot hold 98511 extents. >> >> 4. The following incorrect value is stored in the xattr extent counter, >> # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0 >> core.naextents = -32561 >> >> This patchset adds a new helper function >> (i.e. xfs_iext_count_may_overflow()) to check for overflow of the >> per-inode data and xattr extent counters and invokes it before >> starting an fs operation (e.g. creating a new directory entry). With >> this patchset applied, XFS detects counter overflows and returns with >> an error rather than causing a silent corruption. >> >> The patchset has been tested by executing xfstests with the following >> mkfs.xfs options, >> 1. -m crc=0 -b size=1k >> 2. -m crc=0 -b size=4k >> 3. -m crc=0 -b size=512 >> 4. -m rmapbt=1,reflink=1 -b size=1k >> 5. -m rmapbt=1,reflink=1 -b size=4k >> >> The patches can also be obtained from >> https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14. >> >> I have two patches that define the newly introduced error injection >> tags in xfsprogs >> (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/). >> >> I have also written tests >> (https://github.com/chandanr/xfstests/commits/extent-overflow-tests) >> for verifying the checks introduced in the kernel. >> > > Hi Chandan and XFS folks, > > As you may have heard, I am working on producing a series of > xfs patches for stable v5.10.y. > > My patch selection is documented at [1]. > I am in the process of testing the backport patches against the 5.10.y > baseline using Luis' kdevops [2] fstests runner. > > The configurations that we are testing are: > 1. -m rmbat=0,reflink=1 -b size=4k (default) > 2. -m crc=0 -b size=4k > 3. -m crc=0 -b size=512 > 4. -m rmapbt=1,reflink=1 -b size=1k > 5. -m rmapbt=1,reflink=1 -b size=4k > > This patch set is the only largish series that I selected, because: > - It applies cleanly to 5.10.y > - I evaluated it as low risk and high value > - Chandan has written good regression tests > > I intend to post the rest of the individual selected patches > for review in small batches after they pass the tests, but w.r.t this > patch set - > > Does anyone object to including it in the stable kernel > after it passes the tests? > Hi Amir, The following three commits will have to be skipped from the series, 1. 02092a2f034fdeabab524ae39c2de86ba9ffa15a xfs: Check for extent overflow when renaming dir entries 2. 0dbc5cb1a91cc8c44b1c75429f5b9351837114fd xfs: Check for extent overflow when removing dir entries 3. f5d92749191402c50e32ac83dd9da3b910f5680f xfs: Check for extent overflow when adding dir entries The maximum size of a directory data fork is ~96GiB. This is much smaller than what can be accommodated by the existing data fork extent counter (i.e. 2^31 extents). Also the corresponding test (i.e. xfs/533) has been removed from fstests. Please refer to https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?id=9ae10c882550c48868e7c0baff889bb1a7c7c8e9
On Mon, May 23, 2022 at 7:17 PM Chandan Babu R <chandan.babu@oracle.com> wrote: > > On Mon, May 23, 2022 at 02:15:44 PM +0300, Amir Goldstein wrote: > > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote: > >> > >> XFS does not check for possible overflow of per-inode extent counter > >> fields when adding extents to either data or attr fork. > >> > >> For e.g. > >> 1. Insert 5 million xattrs (each having a value size of 255 bytes) and > >> then delete 50% of them in an alternating manner. > >> > >> 2. On a 4k block sized XFS filesystem instance, the above causes 98511 > >> extents to be created in the attr fork of the inode. > >> > >> xfsaild/loop0 2008 [003] 1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131 > >> > >> 3. The incore inode fork extent counter is a signed 32-bit > >> quantity. However, the on-disk extent counter is an unsigned 16-bit > >> quantity and hence cannot hold 98511 extents. > >> > >> 4. The following incorrect value is stored in the xattr extent counter, > >> # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0 > >> core.naextents = -32561 > >> > >> This patchset adds a new helper function > >> (i.e. xfs_iext_count_may_overflow()) to check for overflow of the > >> per-inode data and xattr extent counters and invokes it before > >> starting an fs operation (e.g. creating a new directory entry). With > >> this patchset applied, XFS detects counter overflows and returns with > >> an error rather than causing a silent corruption. > >> > >> The patchset has been tested by executing xfstests with the following > >> mkfs.xfs options, > >> 1. -m crc=0 -b size=1k > >> 2. -m crc=0 -b size=4k > >> 3. -m crc=0 -b size=512 > >> 4. -m rmapbt=1,reflink=1 -b size=1k > >> 5. -m rmapbt=1,reflink=1 -b size=4k > >> > >> The patches can also be obtained from > >> https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14. > >> > >> I have two patches that define the newly introduced error injection > >> tags in xfsprogs > >> (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/). > >> > >> I have also written tests > >> (https://github.com/chandanr/xfstests/commits/extent-overflow-tests) > >> for verifying the checks introduced in the kernel. > >> > > > > Hi Chandan and XFS folks, > > > > As you may have heard, I am working on producing a series of > > xfs patches for stable v5.10.y. > > > > My patch selection is documented at [1]. > > I am in the process of testing the backport patches against the 5.10.y > > baseline using Luis' kdevops [2] fstests runner. > > > > The configurations that we are testing are: > > 1. -m rmbat=0,reflink=1 -b size=4k (default) > > 2. -m crc=0 -b size=4k > > 3. -m crc=0 -b size=512 > > 4. -m rmapbt=1,reflink=1 -b size=1k > > 5. -m rmapbt=1,reflink=1 -b size=4k > > > > This patch set is the only largish series that I selected, because: > > - It applies cleanly to 5.10.y > > - I evaluated it as low risk and high value > > - Chandan has written good regression tests > > > > I intend to post the rest of the individual selected patches > > for review in small batches after they pass the tests, but w.r.t this > > patch set - > > > > Does anyone object to including it in the stable kernel > > after it passes the tests? > > > > Hi Amir, > > The following three commits will have to be skipped from the series, > > 1. 02092a2f034fdeabab524ae39c2de86ba9ffa15a > xfs: Check for extent overflow when renaming dir entries > > 2. 0dbc5cb1a91cc8c44b1c75429f5b9351837114fd > xfs: Check for extent overflow when removing dir entries > > 3. f5d92749191402c50e32ac83dd9da3b910f5680f > xfs: Check for extent overflow when adding dir entries > > The maximum size of a directory data fork is ~96GiB. This is much smaller than > what can be accommodated by the existing data fork extent counter (i.e. 2^31 > extents). > Thanks for this information! I understand that the "fixes" are not needed, but the moto of the stable tree maintainers is that taking harmless patches is preferred over non clean backports and without those patches, the rest of the series does not apply cleanly. So the question is: does it hurt to take those patches to the stable tree? > Also the corresponding test (i.e. xfs/533) has been removed from > fstests. Please refer to > https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?id=9ae10c882550c48868e7c0baff889bb1a7c7c8e9 > Well the test does not fail so it doesn't hurt either. Right? In my test env, we will occasionally pull latest fstests and then the unneeded test will be removed. Does that sound right? Thanks, Amir.
On Mon, May 23, 2022 at 02:15:44PM +0300, Amir Goldstein wrote: > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote: > > > > XFS does not check for possible overflow of per-inode extent counter > > fields when adding extents to either data or attr fork. > > > > For e.g. > > 1. Insert 5 million xattrs (each having a value size of 255 bytes) and > > then delete 50% of them in an alternating manner. > > > > 2. On a 4k block sized XFS filesystem instance, the above causes 98511 > > extents to be created in the attr fork of the inode. > > > > xfsaild/loop0 2008 [003] 1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131 > > > > 3. The incore inode fork extent counter is a signed 32-bit > > quantity. However, the on-disk extent counter is an unsigned 16-bit > > quantity and hence cannot hold 98511 extents. > > > > 4. The following incorrect value is stored in the xattr extent counter, > > # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0 > > core.naextents = -32561 > > > > This patchset adds a new helper function > > (i.e. xfs_iext_count_may_overflow()) to check for overflow of the > > per-inode data and xattr extent counters and invokes it before > > starting an fs operation (e.g. creating a new directory entry). With > > this patchset applied, XFS detects counter overflows and returns with > > an error rather than causing a silent corruption. > > > > The patchset has been tested by executing xfstests with the following > > mkfs.xfs options, > > 1. -m crc=0 -b size=1k > > 2. -m crc=0 -b size=4k > > 3. -m crc=0 -b size=512 > > 4. -m rmapbt=1,reflink=1 -b size=1k > > 5. -m rmapbt=1,reflink=1 -b size=4k > > > > The patches can also be obtained from > > https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14. > > > > I have two patches that define the newly introduced error injection > > tags in xfsprogs > > (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/). > > > > I have also written tests > > (https://github.com/chandanr/xfstests/commits/extent-overflow-tests) > > for verifying the checks introduced in the kernel. > > > > Hi Chandan and XFS folks, > > As you may have heard, I am working on producing a series of > xfs patches for stable v5.10.y. > > My patch selection is documented at [1]. > I am in the process of testing the backport patches against the 5.10.y > baseline using Luis' kdevops [2] fstests runner. > > The configurations that we are testing are: > 1. -m rmbat=0,reflink=1 -b size=4k (default) > 2. -m crc=0 -b size=4k > 3. -m crc=0 -b size=512 > 4. -m rmapbt=1,reflink=1 -b size=1k > 5. -m rmapbt=1,reflink=1 -b size=4k > > This patch set is the only largish series that I selected, because: > - It applies cleanly to 5.10.y > - I evaluated it as low risk and high value What value does it provide LTS users? This series adds almost no value to normal users - extent count overflows are just something that doesn't happen in production systems at this point in time. The largest data extent count I've ever seen is still an order of magnitude of extents away from overflowing (i.e. 400 million extents seen, 4 billion to overflow), and nobody is using the attribute fork sufficiently hard to overflow 65536 extents (typically a couple of million xattrs per inode). i.e. this series is ground work for upcoming internal filesystem functionality that require much larger attribute forks (parent pointers and fsverity merkle tree storage) to be supported, and allow scope for much larger, massively fragmented VM image files (beyond 16TB on 4kB block size fs for worst case fragmentation/reflink). As a standalone patchset, this provides almost no real benefit to users but adds a whole new set of "hard stop" error paths across every operation that does inode data/attr extent allocation. i.e. the scope of affected functionality is very wide, the benefit to users is pretty much zero. Hence I'm left wondering what criteria ranks this as a high value change... > - Chandan has written good regression tests > > I intend to post the rest of the individual selected patches > for review in small batches after they pass the tests, but w.r.t this > patch set - > > Does anyone object to including it in the stable kernel > after it passes the tests? I prefer that the process doesn't result in taking random unnecesary functionality into stable kernels. The part of the LTS process that I've most disagreed with is the "backport random unnecessary changes" part of the stable selection criteria. It doesn't matter if it's selected by a bot or a human, the problems that causes are the same. Hence on those grounds, I'd say this isn't a stable backport candidate at all... Cheers, Dave.
On Tue, May 24, 2022 at 1:43 AM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, May 23, 2022 at 02:15:44PM +0300, Amir Goldstein wrote: > > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote: > > > > > > XFS does not check for possible overflow of per-inode extent counter > > > fields when adding extents to either data or attr fork. > > > > > > For e.g. > > > 1. Insert 5 million xattrs (each having a value size of 255 bytes) and > > > then delete 50% of them in an alternating manner. > > > > > > 2. On a 4k block sized XFS filesystem instance, the above causes 98511 > > > extents to be created in the attr fork of the inode. > > > > > > xfsaild/loop0 2008 [003] 1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131 > > > > > > 3. The incore inode fork extent counter is a signed 32-bit > > > quantity. However, the on-disk extent counter is an unsigned 16-bit > > > quantity and hence cannot hold 98511 extents. > > > > > > 4. The following incorrect value is stored in the xattr extent counter, > > > # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0 > > > core.naextents = -32561 > > > > > > This patchset adds a new helper function > > > (i.e. xfs_iext_count_may_overflow()) to check for overflow of the > > > per-inode data and xattr extent counters and invokes it before > > > starting an fs operation (e.g. creating a new directory entry). With > > > this patchset applied, XFS detects counter overflows and returns with > > > an error rather than causing a silent corruption. > > > > > > The patchset has been tested by executing xfstests with the following > > > mkfs.xfs options, > > > 1. -m crc=0 -b size=1k > > > 2. -m crc=0 -b size=4k > > > 3. -m crc=0 -b size=512 > > > 4. -m rmapbt=1,reflink=1 -b size=1k > > > 5. -m rmapbt=1,reflink=1 -b size=4k > > > > > > The patches can also be obtained from > > > https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14. > > > > > > I have two patches that define the newly introduced error injection > > > tags in xfsprogs > > > (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/). > > > > > > I have also written tests > > > (https://github.com/chandanr/xfstests/commits/extent-overflow-tests) > > > for verifying the checks introduced in the kernel. > > > > > > > Hi Chandan and XFS folks, > > > > As you may have heard, I am working on producing a series of > > xfs patches for stable v5.10.y. > > > > My patch selection is documented at [1]. > > I am in the process of testing the backport patches against the 5.10.y > > baseline using Luis' kdevops [2] fstests runner. > > > > The configurations that we are testing are: > > 1. -m rmbat=0,reflink=1 -b size=4k (default) > > 2. -m crc=0 -b size=4k > > 3. -m crc=0 -b size=512 > > 4. -m rmapbt=1,reflink=1 -b size=1k > > 5. -m rmapbt=1,reflink=1 -b size=4k > > > > This patch set is the only largish series that I selected, because: > > - It applies cleanly to 5.10.y > > - I evaluated it as low risk and high value > > What value does it provide LTS users? > Cloud providers deploy a large number of VMs/containers and they may use reflink. So I think this could be an issue. > This series adds almost no value to normal users - extent count > overflows are just something that doesn't happen in production > systems at this point in time. The largest data extent count I've > ever seen is still an order of magnitude of extents away from > overflowing (i.e. 400 million extents seen, 4 billion to overflow), > and nobody is using the attribute fork sufficiently hard to overflow > 65536 extents (typically a couple of million xattrs per inode). > > i.e. this series is ground work for upcoming internal filesystem > functionality that require much larger attribute forks (parent > pointers and fsverity merkle tree storage) to be supported, and > allow scope for much larger, massively fragmented VM image files > (beyond 16TB on 4kB block size fs for worst case > fragmentation/reflink). I am not sure I follow this argument. Users can create large attributes, can they not? And users can create massive fragmented/reflinked images, can they not? If we have learned anything, is that if users can do something (i.e. on stable), users will do that, so it may still be worth protecting this workflow? I argue that the reason that you did not see those constructs in the wild yet, is the time it takes until users format new xfs filesystems with mkfs that defaults to reflink enabled and then use latest userspace tools that started to do copy_file_range() or clone on their filesystem, perhaps even without the user's knowledge, such as samba [1]. [1] https://gitlab.com/samba-team/samba/-/merge_requests/2044 > > As a standalone patchset, this provides almost no real benefit to > users but adds a whole new set of "hard stop" error paths across > every operation that does inode data/attr extent allocation. i.e. > the scope of affected functionality is very wide, the benefit > to users is pretty much zero. > > Hence I'm left wondering what criteria ranks this as a high value > change... > Given your inputs, I am not sure that the fix has high value, but I must say I didn't fully understand your argument. It sounded like "We don't need the fix because we did not see the problem yet", but I may have misunderstood you. I am sure that you are aware of the fact that even though 5.10 is almost 2 y/o, it has only been deployed recently by some distros. For example, Amazon AMI [2] and Google Cloud COS [3] images based on the "new" 5.10 kernel were only released about half a year ago. [2] https://aws.amazon.com/about-aws/whats-new/2021/11/amazon-linux-2-ami-kernel-5-10/ [3] https://cloud.google.com/container-optimized-os/docs/release-notes/m93#cos-93-16623-39-6 I have not analysed the distro situation w.r.t xfsprogs, but here the important factor is which version of xfsprogs was used to format the user's filesystem, not which xfsprogs is installed on their system now. > > - Chandan has written good regression tests > > > > I intend to post the rest of the individual selected patches > > for review in small batches after they pass the tests, but w.r.t this > > patch set - > > > > Does anyone object to including it in the stable kernel > > after it passes the tests? > > I prefer that the process doesn't result in taking random unnecesary > functionality into stable kernels. The part of the LTS process that > I've most disagreed with is the "backport random unnecessary > changes" part of the stable selection criteria. It doesn't matter if > it's selected by a bot or a human, the problems that causes are the > same. I am in agreement with you. If you actually look at my selections [4] I think that you will find that they are very far from "random". I have tried to make it VERY easy to review my selections, by listing the links to lore instead of the commit ids and my selection process is also documented in the git log. TBH, *this* series was the one that I was mostly in doubt about, which is one of the reasons I posted it first to the list. I was pretty confident about my risk estimation, but not so much about the value. Also, I am considering my post in this mailing list (without CC stable) part of the process, and the inputs I got from you and from Chandan is exactly what is missing in the regular stable tree process IMO, so I appreciate your inputs very much. > > Hence on those grounds, I'd say this isn't a stable backport > candidate at all... > If my arguments did not convince you, out goes this series! I shall be posting more patches for consideration in the coming weeks. I would appreciate your inputs on those as well. You guys are welcome to review my selection [4] already. Thanks! Amir. [4] https://github.com/amir73il/b4/blob/xfs-5.10.y/xfs-5.10..5.17-fixes.rst
On Tue, May 24, 2022 at 8:36 AM Amir Goldstein <amir73il@gmail.com> wrote: > > On Tue, May 24, 2022 at 1:43 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Mon, May 23, 2022 at 02:15:44PM +0300, Amir Goldstein wrote: > > > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote: > > > > > > > > XFS does not check for possible overflow of per-inode extent counter > > > > fields when adding extents to either data or attr fork. > > > > > > > > For e.g. > > > > 1. Insert 5 million xattrs (each having a value size of 255 bytes) and > > > > then delete 50% of them in an alternating manner. > > > > > > > > 2. On a 4k block sized XFS filesystem instance, the above causes 98511 > > > > extents to be created in the attr fork of the inode. > > > > > > > > xfsaild/loop0 2008 [003] 1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131 > > > > > > > > 3. The incore inode fork extent counter is a signed 32-bit > > > > quantity. However, the on-disk extent counter is an unsigned 16-bit > > > > quantity and hence cannot hold 98511 extents. > > > > > > > > 4. The following incorrect value is stored in the xattr extent counter, > > > > # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0 > > > > core.naextents = -32561 > > > > > > > > This patchset adds a new helper function > > > > (i.e. xfs_iext_count_may_overflow()) to check for overflow of the > > > > per-inode data and xattr extent counters and invokes it before > > > > starting an fs operation (e.g. creating a new directory entry). With > > > > this patchset applied, XFS detects counter overflows and returns with > > > > an error rather than causing a silent corruption. > > > > > > > > The patchset has been tested by executing xfstests with the following > > > > mkfs.xfs options, > > > > 1. -m crc=0 -b size=1k > > > > 2. -m crc=0 -b size=4k > > > > 3. -m crc=0 -b size=512 > > > > 4. -m rmapbt=1,reflink=1 -b size=1k > > > > 5. -m rmapbt=1,reflink=1 -b size=4k > > > > > > > > The patches can also be obtained from > > > > https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14. > > > > > > > > I have two patches that define the newly introduced error injection > > > > tags in xfsprogs > > > > (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/). > > > > > > > > I have also written tests > > > > (https://github.com/chandanr/xfstests/commits/extent-overflow-tests) > > > > for verifying the checks introduced in the kernel. > > > > > > > > > > Hi Chandan and XFS folks, > > > > > > As you may have heard, I am working on producing a series of > > > xfs patches for stable v5.10.y. > > > > > > My patch selection is documented at [1]. > > > I am in the process of testing the backport patches against the 5.10.y > > > baseline using Luis' kdevops [2] fstests runner. > > > > > > The configurations that we are testing are: > > > 1. -m rmbat=0,reflink=1 -b size=4k (default) > > > 2. -m crc=0 -b size=4k > > > 3. -m crc=0 -b size=512 > > > 4. -m rmapbt=1,reflink=1 -b size=1k > > > 5. -m rmapbt=1,reflink=1 -b size=4k > > > > > > This patch set is the only largish series that I selected, because: > > > - It applies cleanly to 5.10.y > > > - I evaluated it as low risk and high value > > > > What value does it provide LTS users? > > > > Cloud providers deploy a large number of VMs/containers > and they may use reflink. So I think this could be an issue. > > > This series adds almost no value to normal users - extent count > > overflows are just something that doesn't happen in production > > systems at this point in time. The largest data extent count I've > > ever seen is still an order of magnitude of extents away from > > overflowing (i.e. 400 million extents seen, 4 billion to overflow), > > and nobody is using the attribute fork sufficiently hard to overflow > > 65536 extents (typically a couple of million xattrs per inode). > > > > i.e. this series is ground work for upcoming internal filesystem > > functionality that require much larger attribute forks (parent > > pointers and fsverity merkle tree storage) to be supported, and > > allow scope for much larger, massively fragmented VM image files > > (beyond 16TB on 4kB block size fs for worst case > > fragmentation/reflink). > > I am not sure I follow this argument. > Users can create large attributes, can they not? > And users can create massive fragmented/reflinked images, can they not? > If we have learned anything, is that if users can do something (i.e. on stable), > users will do that, so it may still be worth protecting this workflow? > > I argue that the reason that you did not see those constructs in the wild yet, > is the time it takes until users format new xfs filesystems with mkfs > that defaults > to reflink enabled and then use latest userspace tools that started to do > copy_file_range() or clone on their filesystem, perhaps even without the > user's knowledge, such as samba [1]. > > [1] https://gitlab.com/samba-team/samba/-/merge_requests/2044 > > > > > As a standalone patchset, this provides almost no real benefit to > > users but adds a whole new set of "hard stop" error paths across > > every operation that does inode data/attr extent allocation. i.e. > > the scope of affected functionality is very wide, the benefit > > to users is pretty much zero. > > > > Hence I'm left wondering what criteria ranks this as a high value > > change... > > > > Given your inputs, I am not sure that the fix has high value, but I must > say I didn't fully understand your argument. > It sounded like > "We don't need the fix because we did not see the problem yet", > but I may have misunderstood you. > > I am sure that you are aware of the fact that even though 5.10 is > almost 2 y/o, it has only been deployed recently by some distros. > > For example, Amazon AMI [2] and Google Cloud COS [3] images based > on the "new" 5.10 kernel were only released about half a year ago. > > [2] https://aws.amazon.com/about-aws/whats-new/2021/11/amazon-linux-2-ami-kernel-5-10/ > [3] https://cloud.google.com/container-optimized-os/docs/release-notes/m93#cos-93-16623-39-6 > > I have not analysed the distro situation w.r.t xfsprogs, but here the > important factor is which version of xfsprogs was used to format the > user's filesystem, not which xfsprogs is installed on their system now. > > > > - Chandan has written good regression tests > > > > > > I intend to post the rest of the individual selected patches > > > for review in small batches after they pass the tests, but w.r.t this > > > patch set - > > > > > > Does anyone object to including it in the stable kernel > > > after it passes the tests? > > > > I prefer that the process doesn't result in taking random unnecesary > > functionality into stable kernels. The part of the LTS process that > > I've most disagreed with is the "backport random unnecessary > > changes" part of the stable selection criteria. It doesn't matter if > > it's selected by a bot or a human, the problems that causes are the > > same. > > I am in agreement with you. > > If you actually look at my selections [4] > I think that you will find that they are very far from "random". > I have tried to make it VERY easy to review my selections, by > listing the links to lore instead of the commit ids and my selection > process is also documented in the git log. > > TBH, *this* series was the one that I was mostly in doubt about, > which is one of the reasons I posted it first to the list. > I was pretty confident about my risk estimation, but not so much > about the value. > > Also, I am considering my post in this mailing list (without CC stable) > part of the process, and the inputs I got from you and from Chandan > is exactly what is missing in the regular stable tree process IMO, so > I appreciate your inputs very much. > > > > > Hence on those grounds, I'd say this isn't a stable backport > > candidate at all... > > > Allow me to rephrase that using a less hypothetical use case. Our team is working on an out-of-band dedupe tool, much like https://markfasheh.github.io/duperemove/duperemove.html but for larger scale filesystems and testing focus is on xfs. In certain settings, such as containers, the tool does not control the running kernel and *if* we require a new kernel, the newest we can require in this setting is 5.10.y. How would the tool know that it can safely create millions of dups that may get fragmented? One cannot expect from a user space tool to check which kernel it is running on, even asking which filesystem it is running on would be an irregular pattern. The tool just checks for clone/dedupe support in the underlying filesystem. The way I see it, backporting these changes to LTS kernel is the only way to move forward, unless you can tell me, and I did not understand that from your response, why our tool is safe to use on 5.10.y and why fragmentation cannot lead to hitting maximum extent limitation in kernel 5.10.y. So with that information in mind, I have to ask again: Does anyone *object* to including this series in the stable kernel after it passes the tests? Chandan and all, Do you consider it *harmful* to apply the 3 commits about directory extents that Chandan listed as "unneeded"? Please do not regard this as a philosophical question. Is there an actual known bug/regression from applying those 3 patches to the 5.10.y kernel? Because my fstests loop has been running on the recommended xfs configs for over 30 times now and have not detected any regression from the baseline LTS kernel so far. Thanks, Amir.
On Mon, May 23, 2022 at 10:06 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Mon, May 23, 2022 at 7:17 PM Chandan Babu R <chandan.babu@oracle.com> wrote: > > > > On Mon, May 23, 2022 at 02:15:44 PM +0300, Amir Goldstein wrote: > > > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote: > > >> > > >> XFS does not check for possible overflow of per-inode extent counter > > >> fields when adding extents to either data or attr fork. > > >> > > >> For e.g. > > >> 1. Insert 5 million xattrs (each having a value size of 255 bytes) and > > >> then delete 50% of them in an alternating manner. > > >> > > >> 2. On a 4k block sized XFS filesystem instance, the above causes 98511 > > >> extents to be created in the attr fork of the inode. > > >> > > >> xfsaild/loop0 2008 [003] 1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131 > > >> > > >> 3. The incore inode fork extent counter is a signed 32-bit > > >> quantity. However, the on-disk extent counter is an unsigned 16-bit > > >> quantity and hence cannot hold 98511 extents. > > >> > > >> 4. The following incorrect value is stored in the xattr extent counter, > > >> # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0 > > >> core.naextents = -32561 > > >> > > >> This patchset adds a new helper function > > >> (i.e. xfs_iext_count_may_overflow()) to check for overflow of the > > >> per-inode data and xattr extent counters and invokes it before > > >> starting an fs operation (e.g. creating a new directory entry). With > > >> this patchset applied, XFS detects counter overflows and returns with > > >> an error rather than causing a silent corruption. > > >> > > >> The patchset has been tested by executing xfstests with the following > > >> mkfs.xfs options, > > >> 1. -m crc=0 -b size=1k > > >> 2. -m crc=0 -b size=4k > > >> 3. -m crc=0 -b size=512 > > >> 4. -m rmapbt=1,reflink=1 -b size=1k > > >> 5. -m rmapbt=1,reflink=1 -b size=4k > > >> > > >> The patches can also be obtained from > > >> https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14. > > >> > > >> I have two patches that define the newly introduced error injection > > >> tags in xfsprogs > > >> (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/). > > >> > > >> I have also written tests > > >> (https://github.com/chandanr/xfstests/commits/extent-overflow-tests) > > >> for verifying the checks introduced in the kernel. > > >> > > > > > > Hi Chandan and XFS folks, > > > > > > As you may have heard, I am working on producing a series of > > > xfs patches for stable v5.10.y. > > > > > > My patch selection is documented at [1]. > > > I am in the process of testing the backport patches against the 5.10.y > > > baseline using Luis' kdevops [2] fstests runner. > > > > > > The configurations that we are testing are: > > > 1. -m rmbat=0,reflink=1 -b size=4k (default) > > > 2. -m crc=0 -b size=4k > > > 3. -m crc=0 -b size=512 > > > 4. -m rmapbt=1,reflink=1 -b size=1k > > > 5. -m rmapbt=1,reflink=1 -b size=4k > > > > > > This patch set is the only largish series that I selected, because: > > > - It applies cleanly to 5.10.y > > > - I evaluated it as low risk and high value > > > - Chandan has written good regression tests > > > > > > I intend to post the rest of the individual selected patches > > > for review in small batches after they pass the tests, but w.r.t this > > > patch set - > > > > > > Does anyone object to including it in the stable kernel > > > after it passes the tests? > > > > > > > Hi Amir, > > > > The following three commits will have to be skipped from the series, > > > > 1. 02092a2f034fdeabab524ae39c2de86ba9ffa15a > > xfs: Check for extent overflow when renaming dir entries > > > > 2. 0dbc5cb1a91cc8c44b1c75429f5b9351837114fd > > xfs: Check for extent overflow when removing dir entries > > > > 3. f5d92749191402c50e32ac83dd9da3b910f5680f > > xfs: Check for extent overflow when adding dir entries > > > > The maximum size of a directory data fork is ~96GiB. This is much smaller than > > what can be accommodated by the existing data fork extent counter (i.e. 2^31 > > extents). > > > > Thanks for this information! > > I understand that the "fixes" are not needed, but the moto of the stable > tree maintainers is that taking harmless patches is preferred over non > clean backports and without those patches, the rest of the series does > not apply cleanly. > > So the question is: does it hurt to take those patches to the stable tree? All right, I've found the revert partial patch in for-next: 83a21c18441f xfs: Directory's data fork extent counter can never overflow I can backport this patch to stable after it hits mainline (since this is not an urgent fix I would wait for v.5.19.0) with the obvious omission of the XFS_MAX_EXTCNT_*_FORK_LARGE constants. But even then, unless we have a clear revert in mainline, it is better to have the history in stable as it was in mainline. Furthermore, stable, even more than mainline, should always prefer safety over performance optimization, the sending the 3 patches already in mainline to stable without the partial revert is better than sending no patches at all and better then delaying the process. Thanks, Amir.
On Tue, May 24, 2022 at 08:36:50AM +0300, Amir Goldstein wrote: > On Tue, May 24, 2022 at 1:43 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Mon, May 23, 2022 at 02:15:44PM +0300, Amir Goldstein wrote: > > > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote: > > > > > > > > XFS does not check for possible overflow of per-inode extent counter > > > > fields when adding extents to either data or attr fork. > > > > > > > > For e.g. > > > > 1. Insert 5 million xattrs (each having a value size of 255 bytes) and > > > > then delete 50% of them in an alternating manner. > > > > > > > > 2. On a 4k block sized XFS filesystem instance, the above causes 98511 > > > > extents to be created in the attr fork of the inode. > > > > > > > > xfsaild/loop0 2008 [003] 1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131 > > > > > > > > 3. The incore inode fork extent counter is a signed 32-bit > > > > quantity. However, the on-disk extent counter is an unsigned 16-bit > > > > quantity and hence cannot hold 98511 extents. > > > > > > > > 4. The following incorrect value is stored in the xattr extent counter, > > > > # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0 > > > > core.naextents = -32561 > > > > > > > > This patchset adds a new helper function > > > > (i.e. xfs_iext_count_may_overflow()) to check for overflow of the > > > > per-inode data and xattr extent counters and invokes it before > > > > starting an fs operation (e.g. creating a new directory entry). With > > > > this patchset applied, XFS detects counter overflows and returns with > > > > an error rather than causing a silent corruption. > > > > > > > > The patchset has been tested by executing xfstests with the following > > > > mkfs.xfs options, > > > > 1. -m crc=0 -b size=1k > > > > 2. -m crc=0 -b size=4k > > > > 3. -m crc=0 -b size=512 > > > > 4. -m rmapbt=1,reflink=1 -b size=1k > > > > 5. -m rmapbt=1,reflink=1 -b size=4k > > > > > > > > The patches can also be obtained from > > > > https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14. > > > > > > > > I have two patches that define the newly introduced error injection > > > > tags in xfsprogs > > > > (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/). > > > > > > > > I have also written tests > > > > (https://github.com/chandanr/xfstests/commits/extent-overflow-tests) > > > > for verifying the checks introduced in the kernel. > > > > > > > > > > Hi Chandan and XFS folks, > > > > > > As you may have heard, I am working on producing a series of > > > xfs patches for stable v5.10.y. > > > > > > My patch selection is documented at [1]. > > > I am in the process of testing the backport patches against the 5.10.y > > > baseline using Luis' kdevops [2] fstests runner. > > > > > > The configurations that we are testing are: > > > 1. -m rmbat=0,reflink=1 -b size=4k (default) > > > 2. -m crc=0 -b size=4k > > > 3. -m crc=0 -b size=512 > > > 4. -m rmapbt=1,reflink=1 -b size=1k > > > 5. -m rmapbt=1,reflink=1 -b size=4k > > > > > > This patch set is the only largish series that I selected, because: > > > - It applies cleanly to 5.10.y > > > - I evaluated it as low risk and high value > > > > What value does it provide LTS users? > > > > Cloud providers deploy a large number of VMs/containers > and they may use reflink. So I think this could be an issue. Cloud providers are not deploying multi-TB VM images on XFS without also using some mechanism for avoiding worst-case fragmentation. They know all about the problems that manifest when extent counts get into the tens of millions, let alone billions.... e.g. first access to a file pulls the entire extent list into memory, so for a file with 4 billion extents this will take hours to pull into memory (single threaded, synchronous read IO of millions of filesystem blocks) and consume and consume >100GB of RAM for the in-memory extent list. Having VM startup get delayed by hours and put a massive load on the cloud storage infrastructure for that entire length of time isn't desirable behaviour... For multi-TB VM image deployment - especially with reflink on the image file - extent size hints are needed to mitigate worst case fragmentation. Reflink copies can run at up to about 100,000 extents/s, so if you reflink a file with 4 billion extents in it, not only do you need another 100GB RAM, you also need to wait several hours for the reflink to run. And while that reflink is running, nothing else has access the data in that VM image: your VM is *down* for *hours* while you snapshot it. Typical mitigation is extent size hints in the MB ranges to reduce worst case fragmentation by two orders of magnitude (i.e. limit to tens of millions of extents, not billions) which brings snapshot times down to a minute or two. IOWs, it's obviously not practical to scale VM images out to billions of extents, even though we support extent counts in the billions. > > This series adds almost no value to normal users - extent count > > overflows are just something that doesn't happen in production > > systems at this point in time. The largest data extent count I've > > ever seen is still an order of magnitude of extents away from > > overflowing (i.e. 400 million extents seen, 4 billion to overflow), > > and nobody is using the attribute fork sufficiently hard to overflow > > 65536 extents (typically a couple of million xattrs per inode). > > > > i.e. this series is ground work for upcoming internal filesystem > > functionality that require much larger attribute forks (parent > > pointers and fsverity merkle tree storage) to be supported, and > > allow scope for much larger, massively fragmented VM image files > > (beyond 16TB on 4kB block size fs for worst case > > fragmentation/reflink). > > I am not sure I follow this argument. > Users can create large attributes, can they not? Sure. But *nobody does*, and there are good reasons we don't see people doing this. The reality is that apps don't use xattrs heavily because filesystems are traditionally very bad at storing even moderate numbers of xattrs. XFS is the exception to the rule. Hence nobody is trying to use a few million xattrs per inode right now, and it's unlikely anyone will unless they specifically target XFS. In which case, they are going to want the large extent count stuff that just got merged into the for-next tree, and this whole discussion is moot.... > And users can create massive fragmented/reflinked images, can they not? Yes, and they will hit scalability problems long before they get anywhere near 4 billion extents. > If we have learned anything, is that if users can do something (i.e. on stable), > users will do that, so it may still be worth protecting this workflow? If I have learned anything, it's that huge extent counts are highly impractical for most workloads for one reason or another. We are a long way for enabling practical use of extent counts in the billions. Demand paging the extent list is the bare minimum we need, but then there's sheer scale of modifications reflink and unlink need to make (billions of transactions to share/free billions of individual extents) and there's no magic solution to that. > I argue that the reason that you did not see those constructs in the wild yet, > is the time it takes until users format new xfs filesystems with mkfs It really has nothing to do with filesystem formats and everything to do with the *cost* of creating, accessing, indexing and managing billions of extents. Have you ever tried to create a file with 4 billion extents in it? Even using fallocate to do it as fast as possible (no data IO!), I ran out of RAM on my 128GB test machine after 6 days of doing nothing but running fallocate() on a single inode. The kernel died a horrible OOM killer death at around 2.5 billion extents because the extent list cannot be reclaimed from memory while the inode is in use and the kernel ran out of all other memory it could reclaim as the extent list grew. The only way to fix that is to make the extent lists reclaimable (i.e. demand paging of the in-memory extent list) and that's a big chunk of work that isn't on anyone's radar right now. > Given your inputs, I am not sure that the fix has high value, but I must > say I didn't fully understand your argument. > It sounded like > "We don't need the fix because we did not see the problem yet", > but I may have misunderstood you. I hope you now realise that there are much bigger practical scalability limitations with extent lists and reflink that will manifest in production systems long before we get anywhere near billions of extents per inode. Cheers, Dave.
On Wed, May 25, 2022 at 10:33 AM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, May 24, 2022 at 08:36:50AM +0300, Amir Goldstein wrote: > > On Tue, May 24, 2022 at 1:43 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Mon, May 23, 2022 at 02:15:44PM +0300, Amir Goldstein wrote: > > > > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote: > > > > > > > > > > XFS does not check for possible overflow of per-inode extent counter > > > > > fields when adding extents to either data or attr fork. > > > > > > > > > > For e.g. > > > > > 1. Insert 5 million xattrs (each having a value size of 255 bytes) and > > > > > then delete 50% of them in an alternating manner. > > > > > > > > > > 2. On a 4k block sized XFS filesystem instance, the above causes 98511 > > > > > extents to be created in the attr fork of the inode. > > > > > > > > > > xfsaild/loop0 2008 [003] 1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131 > > > > > > > > > > 3. The incore inode fork extent counter is a signed 32-bit > > > > > quantity. However, the on-disk extent counter is an unsigned 16-bit > > > > > quantity and hence cannot hold 98511 extents. > > > > > > > > > > 4. The following incorrect value is stored in the xattr extent counter, > > > > > # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0 > > > > > core.naextents = -32561 > > > > > > > > > > This patchset adds a new helper function > > > > > (i.e. xfs_iext_count_may_overflow()) to check for overflow of the > > > > > per-inode data and xattr extent counters and invokes it before > > > > > starting an fs operation (e.g. creating a new directory entry). With > > > > > this patchset applied, XFS detects counter overflows and returns with > > > > > an error rather than causing a silent corruption. > > > > > > > > > > The patchset has been tested by executing xfstests with the following > > > > > mkfs.xfs options, > > > > > 1. -m crc=0 -b size=1k > > > > > 2. -m crc=0 -b size=4k > > > > > 3. -m crc=0 -b size=512 > > > > > 4. -m rmapbt=1,reflink=1 -b size=1k > > > > > 5. -m rmapbt=1,reflink=1 -b size=4k > > > > > > > > > > The patches can also be obtained from > > > > > https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14. > > > > > > > > > > I have two patches that define the newly introduced error injection > > > > > tags in xfsprogs > > > > > (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/). > > > > > > > > > > I have also written tests > > > > > (https://github.com/chandanr/xfstests/commits/extent-overflow-tests) > > > > > for verifying the checks introduced in the kernel. > > > > > > > > > > > > > Hi Chandan and XFS folks, > > > > > > > > As you may have heard, I am working on producing a series of > > > > xfs patches for stable v5.10.y. > > > > > > > > My patch selection is documented at [1]. > > > > I am in the process of testing the backport patches against the 5.10.y > > > > baseline using Luis' kdevops [2] fstests runner. > > > > > > > > The configurations that we are testing are: > > > > 1. -m rmbat=0,reflink=1 -b size=4k (default) > > > > 2. -m crc=0 -b size=4k > > > > 3. -m crc=0 -b size=512 > > > > 4. -m rmapbt=1,reflink=1 -b size=1k > > > > 5. -m rmapbt=1,reflink=1 -b size=4k > > > > > > > > This patch set is the only largish series that I selected, because: > > > > - It applies cleanly to 5.10.y > > > > - I evaluated it as low risk and high value > > > > > > What value does it provide LTS users? > > > > > > > Cloud providers deploy a large number of VMs/containers > > and they may use reflink. So I think this could be an issue. > > Cloud providers are not deploying multi-TB VM images on XFS without > also using some mechanism for avoiding worst-case fragmentation. > They know all about the problems that manifest when extent > counts get into the tens of millions, let alone billions.... > > e.g. first access to a file pulls the entire extent list into > memory, so for a file with 4 billion extents this will take hours to > pull into memory (single threaded, synchronous read IO of millions > of filesystem blocks) and consume and consume >100GB of RAM for the > in-memory extent list. Having VM startup get delayed by hours and > put a massive load on the cloud storage infrastructure for that > entire length of time isn't desirable behaviour... > > For multi-TB VM image deployment - especially with reflink on the > image file - extent size hints are needed to mitigate worst case > fragmentation. Reflink copies can run at up to about 100,000 > extents/s, so if you reflink a file with 4 billion extents in it, > not only do you need another 100GB RAM, you also need to wait > several hours for the reflink to run. And while that reflink is > running, nothing else has access the data in that VM image: your VM > is *down* for *hours* while you snapshot it. > > Typical mitigation is extent size hints in the MB ranges to reduce > worst case fragmentation by two orders of magnitude (i.e. limit to > tens of millions of extents, not billions) which brings snapshot > times down to a minute or two. > > IOWs, it's obviously not practical to scale VM images out to > billions of extents, even though we support extent counts in the > billions. > > > > This series adds almost no value to normal users - extent count > > > overflows are just something that doesn't happen in production > > > systems at this point in time. The largest data extent count I've > > > ever seen is still an order of magnitude of extents away from > > > overflowing (i.e. 400 million extents seen, 4 billion to overflow), > > > and nobody is using the attribute fork sufficiently hard to overflow > > > 65536 extents (typically a couple of million xattrs per inode). > > > > > > i.e. this series is ground work for upcoming internal filesystem > > > functionality that require much larger attribute forks (parent > > > pointers and fsverity merkle tree storage) to be supported, and > > > allow scope for much larger, massively fragmented VM image files > > > (beyond 16TB on 4kB block size fs for worst case > > > fragmentation/reflink). > > > > I am not sure I follow this argument. > > Users can create large attributes, can they not? > > Sure. But *nobody does*, and there are good reasons we don't see > people doing this. > > The reality is that apps don't use xattrs heavily because > filesystems are traditionally very bad at storing even moderate > numbers of xattrs. XFS is the exception to the rule. Hence nobody is > trying to use a few million xattrs per inode right now, and it's > unlikely anyone will unless they specifically target XFS. In which > case, they are going to want the large extent count stuff that just > got merged into the for-next tree, and this whole discussion is > moot.... With all the barriers to large extents count that you mentioned I wonder how large extent counters feature mitigates those, but that is irrelevant to the question at hand. > > > And users can create massive fragmented/reflinked images, can they not? > > Yes, and they will hit scalability problems long before they get > anywhere near 4 billion extents. > > > If we have learned anything, is that if users can do something (i.e. on stable), > > users will do that, so it may still be worth protecting this workflow? > > If I have learned anything, it's that huge extent counts are highly > impractical for most workloads for one reason or another. We are a > long way for enabling practical use of extent counts in the > billions. Demand paging the extent list is the bare minimum we need, > but then there's sheer scale of modifications reflink and unlink > need to make (billions of transactions to share/free billions of > individual extents) and there's no magic solution to that. > > > I argue that the reason that you did not see those constructs in the wild yet, > > is the time it takes until users format new xfs filesystems with mkfs > > It really has nothing to do with filesystem formats and everything > to do with the *cost* of creating, accessing, indexing and managing > billions of extents. > > Have you ever tried to create a file with 4 billion extents in it? > Even using fallocate to do it as fast as possible (no data IO!), I > ran out of RAM on my 128GB test machine after 6 days of doing > nothing but running fallocate() on a single inode. The kernel died a > horrible OOM killer death at around 2.5 billion extents because the > extent list cannot be reclaimed from memory while the inode is in > use and the kernel ran out of all other memory it could reclaim as > the extent list grew. > > The only way to fix that is to make the extent lists reclaimable > (i.e. demand paging of the in-memory extent list) and that's a big > chunk of work that isn't on anyone's radar right now. > > > Given your inputs, I am not sure that the fix has high value, but I must > > say I didn't fully understand your argument. > > It sounded like > > "We don't need the fix because we did not see the problem yet", > > but I may have misunderstood you. > > I hope you now realise that there are much bigger practical > scalability limitations with extent lists and reflink that will > manifest in production systems long before we get anywhere near > billions of extents per inode. > I do! And I *really* appreciate the time that you took to explain it to me (and to everyone). I'm dropping this series from my xfs-5.10.y queue. Thanks, Amir.
On Tue, May 24, 2022 at 07:05:07PM +0300, Amir Goldstein wrote: > On Tue, May 24, 2022 at 8:36 AM Amir Goldstein <amir73il@gmail.com> wrote: > > Allow me to rephrase that using a less hypothetical use case. > > Our team is working on an out-of-band dedupe tool, much like > https://markfasheh.github.io/duperemove/duperemove.html > but for larger scale filesystems and testing focus is on xfs. dedupe is nothing new. It's being done in production systems and has been for a while now. e.g. Veeam has a production server back end for their reflink/dedupe based backup software that is hosted on XFS. The only scalability issues we've seen with those systems managing tens of TB of heavily cross-linked files so far have been limited to how long unlink of those large files takes. Dedupe/reflink speeds up ingest for backup farms, but it slows down removal/garbage collection of backup that are no longer needed. The big reflink/dedupe backup farms I've seen problems with are generally dealing with extent counts per file in the tens of millions, which is still very managable. Maybe we'll see more problems as data sets grow, but it's also likely that the crosslinked data sets the applications build will scale out (more base files) instead of up (larger base files). This will mean they remain at the "tens of millions of extents per file" level and won't stress the filesystem any more than they already do. > In certain settings, such as containers, the tool does not control the > running kernel and *if* we require a new kernel, the newest we can > require in this setting is 5.10.y. *If* you have a customer that creates a billion extents in a single file, then you could consider backporting this. But until managing billions of extents per file is an actual issue for production filesystems, it's unnecessary to backport these changes. > How would the tool know that it can safely create millions of dups > that may get fragmented? Millions or shared extents in a single file aren't a problem at all. Millions of references to a single shared block aren't a problem at all, either. But there are limits to how much you can share a single block, and those limits are *highly variable* because they are dependent on free space being available to record references. e.g. XFS can share a single block a maximum of 2^32 -1 times. If a user turns on rmapbt, that max share limit drops way down to however many individual rmap records can be stored in the rmap btree before the AG runs out of space. If the AGs are small and/or full of other data, that could limit sharing of a single block to a few hundreds of references. IOWs, applications creating shared extents must expect the operation to fail at any time, without warning. And dedupe applications need to be able to index multiple replicas of the same block so that they aren't limited to deduping that data to a single block that has arbitrary limits on how many times it can be shared. > Does anyone *object* to including this series in the stable kernel > after it passes the tests? If you end up having a customer that hits a billion extents in a single file, then you can backport these patches to the 5.10.y series. But without any obvious production need for these patches, they don't fit the criteria for stable backports... Don't change what ain't broke. Cheers, Dave.
On Wed, May 25, 2022 at 10:48:09AM +0300, Amir Goldstein wrote: > On Wed, May 25, 2022 at 10:33 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Tue, May 24, 2022 at 08:36:50AM +0300, Amir Goldstein wrote: > > > On Tue, May 24, 2022 at 1:43 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > On Mon, May 23, 2022 at 02:15:44PM +0300, Amir Goldstein wrote: > > > > > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote: > > > > > > I am not sure I follow this argument. > > > Users can create large attributes, can they not? > > > > Sure. But *nobody does*, and there are good reasons we don't see > > people doing this. > > > > The reality is that apps don't use xattrs heavily because > > filesystems are traditionally very bad at storing even moderate > > numbers of xattrs. XFS is the exception to the rule. Hence nobody is > > trying to use a few million xattrs per inode right now, and it's > > unlikely anyone will unless they specifically target XFS. In which > > case, they are going to want the large extent count stuff that just > > got merged into the for-next tree, and this whole discussion is > > moot.... > > With all the barriers to large extents count that you mentioned > I wonder how large extent counters feature mitigates those, > but that is irrelevant to the question at hand. They don't. That's the point I'm trying to make - these patches don't actually fix any problems with large data fork extent counts - they just allow them to get bigger. As I said earlier - the primary driver for these changes is not growing the number of data extents or reflink - it's growing the amount of data we can store in the attribute fork. We need to grow that from 2^16 extents to 2^32 extents because we want to be able to store hundreds of millions of xattrs per file for internal filesystem purposes. Extending the data fork to 2^48 extents at the same time just makes sense from an on-disk format perspective, not because the current code can scale effectively to 2^32 extents, but because we're already changing all that code to support a different attr fork extent size. We will probably need >2^32 extents in the next decade, so we're making the change now while we are touching the code.... There are future mods planned that will make large extent counts bearable, but we don't have any idea how to solve problems like making reflink go from O(n) to O(log n) to make reflink of billion extent files an every day occurrence.... Cheers, Dave.