Message ID | 20230518223326.18744-1-sarthakkukreti@chromium.org (mailing list archive) |
---|---|
Headers | show |
Series | Introduce provisioning primitives | expand |
FYI, I really don't think this primitive is a good idea. In the concept of non-overwritable storage (NAND, SMR drives) the entire concept of a one-shoot 'provisioning' that will guarantee later writes are always possible is simply bogus.
On Fri, May 19 2023 at 12:09P -0400, Christoph Hellwig <hch@infradead.org> wrote: > FYI, I really don't think this primitive is a good idea. In the > concept of non-overwritable storage (NAND, SMR drives) the entire > concept of a one-shoot 'provisioning' that will guarantee later writes > are always possible is simply bogus. Valid point for sure, such storage shouldn't advertise support (and will return -EOPNOTSUPP). But the primitive still has utility for other classes of storage.
On Fri, May 19, 2023 at 10:41:31AM -0400, Mike Snitzer wrote: > On Fri, May 19 2023 at 12:09P -0400, > Christoph Hellwig <hch@infradead.org> wrote: > > > FYI, I really don't think this primitive is a good idea. In the > > concept of non-overwritable storage (NAND, SMR drives) the entire > > concept of a one-shoot 'provisioning' that will guarantee later writes > > are always possible is simply bogus. > > Valid point for sure, such storage shouldn't advertise support (and > will return -EOPNOTSUPP). > > But the primitive still has utility for other classes of storage. Yet the thing people are wanting to us filesystem developers to use this with is thinly provisioned storage that has snapshot capability. That, by definition, is non-overwritable storage. These are the use cases people are asking filesystes to gracefully handle and report errors when the sparse backing store runs out of space. e.g. journal writes after a snapshot is taken on a busy filesystem are always an overwrite and this requires more space in the storage device for the write to succeed. ENOSPC from the backing device for journal IO is a -fatal error-. Hence if REQ_PROVISION doesn't guarantee space for overwrites after snapshots, then it's not actually useful for solving the real world use cases we actually need device-level provisioning to solve. It is not viable for filesystems to have to reprovision space for in-place metadata overwrites after every snapshot - the filesystem may not even know a snapshot has been taken! And it's not feasible for filesystems to provision on demand before they modify metadata because we don't know what metadata is going to need to be modified before we start modifying metadata in transactions. If we get ENOSPC from provisioning in the middle of a dirty transcation, it's all over just the same as if we get ENOSPC during metadata writeback... Hence what filesystems actually need is device provisioned space to be -always over-writable- without ENOSPC occurring. Ideally, if we provision a range of the block device, the block device *must* guarantee all future writes to that LBA range succeeds. That guarantee needs to stand until we discard or unmap the LBA range, and for however many writes we do to that LBA range. e.g. If the device takes a snapshot, it needs to reprovision the potential COW ranges that overlap with the provisioned LBA range at snapshot time. e.g. by re-reserving the space from the backing pool for the provisioned space so if a COW occurs there is space guaranteed for it to succeed. If there isn't space in the backing pool for the reprovisioning, then whatever operation that triggers the COW behaviour should fail with ENOSPC before doing anything else.... Software devices like dm-thin/snapshot should really only need to keep a persistent map of the provisioned space and refresh space reservations for used space within that map whenever something that triggers COW behaviour occurs. i.e. a snapshot needs to reset the provisioned ranges back to "all ranges are freshly provisioned" before the snapshot is started. If that space is not available in the backing pool, then the snapshot attempt gets ENOSPC.... That means filesystems only need to provision space for journals and fixed metadata at mkfs time, and they only need issue a REQ_PROVISION bio when they first allocate over-write in place metadata. We already have online discard and/or fstrim for releasing provisioned space via discards. This will require some mods to filesystems like ext4 and XFS to issue REQ_PROVISION and fail gracefully during metadata allocation. However, doing so means that we can actually harden filesystems against sparse block device ENOSPC errors by ensuring they will never occur in critical filesystem structures.... -Dave.
On Fri, May 19 2023 at 7:07P -0400, Dave Chinner <david@fromorbit.com> wrote: > On Fri, May 19, 2023 at 10:41:31AM -0400, Mike Snitzer wrote: > > On Fri, May 19 2023 at 12:09P -0400, > > Christoph Hellwig <hch@infradead.org> wrote: > > > > > FYI, I really don't think this primitive is a good idea. In the > > > concept of non-overwritable storage (NAND, SMR drives) the entire > > > concept of a one-shoot 'provisioning' that will guarantee later writes > > > are always possible is simply bogus. > > > > Valid point for sure, such storage shouldn't advertise support (and > > will return -EOPNOTSUPP). > > > > But the primitive still has utility for other classes of storage. > > Yet the thing people are wanting to us filesystem developers to use > this with is thinly provisioned storage that has snapshot > capability. That, by definition, is non-overwritable storage. These > are the use cases people are asking filesystes to gracefully handle > and report errors when the sparse backing store runs out of space. DM thinp falls into this category but as you detailed it can be made to work reliably. To carry that forward we need to first establish the REQ_PROVISION primitive (with this series). Follow-on associated dm-thinp enhancements can then serve as reference for how to take advantage of XFS's ability to operate reliably of thinly provisioned storage. > e.g. journal writes after a snapshot is taken on a busy filesystem > are always an overwrite and this requires more space in the storage > device for the write to succeed. ENOSPC from the backing device for > journal IO is a -fatal error-. Hence if REQ_PROVISION doesn't > guarantee space for overwrites after snapshots, then it's not > actually useful for solving the real world use cases we actually > need device-level provisioning to solve. > > It is not viable for filesystems to have to reprovision space for > in-place metadata overwrites after every snapshot - the filesystem > may not even know a snapshot has been taken! And it's not feasible > for filesystems to provision on demand before they modify metadata > because we don't know what metadata is going to need to be modified > before we start modifying metadata in transactions. If we get ENOSPC > from provisioning in the middle of a dirty transcation, it's all > over just the same as if we get ENOSPC during metadata writeback... > > Hence what filesystems actually need is device provisioned space to > be -always over-writable- without ENOSPC occurring. Ideally, if we > provision a range of the block device, the block device *must* > guarantee all future writes to that LBA range succeeds. That > guarantee needs to stand until we discard or unmap the LBA range, > and for however many writes we do to that LBA range. > > e.g. If the device takes a snapshot, it needs to reprovision the > potential COW ranges that overlap with the provisioned LBA range at > snapshot time. e.g. by re-reserving the space from the backing pool > for the provisioned space so if a COW occurs there is space > guaranteed for it to succeed. If there isn't space in the backing > pool for the reprovisioning, then whatever operation that triggers > the COW behaviour should fail with ENOSPC before doing anything > else.... Happy to implement this in dm-thinp. Each thin block will need a bit to say if the block must be REQ_PROVISION'd at time of snapshot (and the resulting block will need the same bit set). Walking all blocks of a thin device and triggering REQ_PROVISION for each will obviously make thin snapshot creation take more time. I think this approach is better than having a dedicated bitmap hooked off each thin device's metadata (with bitmap being copied and walked at the time of snapshot). But we'll see... I'll get with Joe to discuss further. > Software devices like dm-thin/snapshot should really only need to > keep a persistent map of the provisioned space and refresh space > reservations for used space within that map whenever something that > triggers COW behaviour occurs. i.e. a snapshot needs to reset the > provisioned ranges back to "all ranges are freshly provisioned" > before the snapshot is started. If that space is not available in > the backing pool, then the snapshot attempt gets ENOSPC.... > > That means filesystems only need to provision space for journals and > fixed metadata at mkfs time, and they only need issue a > REQ_PROVISION bio when they first allocate over-write in place > metadata. We already have online discard and/or fstrim for releasing > provisioned space via discards. > > This will require some mods to filesystems like ext4 and XFS to > issue REQ_PROVISION and fail gracefully during metadata allocation. > However, doing so means that we can actually harden filesystems > against sparse block device ENOSPC errors by ensuring they will > never occur in critical filesystem structures.... Yes, let's finally _do_ this! ;) Mike
On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote: > On Fri, May 19 2023 at 7:07P -0400, > Dave Chinner <david@fromorbit.com> wrote: > > > On Fri, May 19, 2023 at 10:41:31AM -0400, Mike Snitzer wrote: > > > On Fri, May 19 2023 at 12:09P -0400, > > > Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > FYI, I really don't think this primitive is a good idea. In the > > > > concept of non-overwritable storage (NAND, SMR drives) the entire > > > > concept of a one-shoot 'provisioning' that will guarantee later writes > > > > are always possible is simply bogus. > > > > > > Valid point for sure, such storage shouldn't advertise support (and > > > will return -EOPNOTSUPP). > > > > > > But the primitive still has utility for other classes of storage. > > > > Yet the thing people are wanting to us filesystem developers to use > > this with is thinly provisioned storage that has snapshot > > capability. That, by definition, is non-overwritable storage. These > > are the use cases people are asking filesystes to gracefully handle > > and report errors when the sparse backing store runs out of space. > > DM thinp falls into this category but as you detailed it can be made > to work reliably. To carry that forward we need to first establish > the REQ_PROVISION primitive (with this series). > > Follow-on associated dm-thinp enhancements can then serve as reference > for how to take advantage of XFS's ability to operate reliably of > thinly provisioned storage. > > > e.g. journal writes after a snapshot is taken on a busy filesystem > > are always an overwrite and this requires more space in the storage > > device for the write to succeed. ENOSPC from the backing device for > > journal IO is a -fatal error-. Hence if REQ_PROVISION doesn't > > guarantee space for overwrites after snapshots, then it's not > > actually useful for solving the real world use cases we actually > > need device-level provisioning to solve. > > > > It is not viable for filesystems to have to reprovision space for > > in-place metadata overwrites after every snapshot - the filesystem > > may not even know a snapshot has been taken! And it's not feasible > > for filesystems to provision on demand before they modify metadata > > because we don't know what metadata is going to need to be modified > > before we start modifying metadata in transactions. If we get ENOSPC > > from provisioning in the middle of a dirty transcation, it's all > > over just the same as if we get ENOSPC during metadata writeback... > > > > Hence what filesystems actually need is device provisioned space to > > be -always over-writable- without ENOSPC occurring. Ideally, if we > > provision a range of the block device, the block device *must* > > guarantee all future writes to that LBA range succeeds. That > > guarantee needs to stand until we discard or unmap the LBA range, > > and for however many writes we do to that LBA range. > > > > e.g. If the device takes a snapshot, it needs to reprovision the > > potential COW ranges that overlap with the provisioned LBA range at > > snapshot time. e.g. by re-reserving the space from the backing pool > > for the provisioned space so if a COW occurs there is space > > guaranteed for it to succeed. If there isn't space in the backing > > pool for the reprovisioning, then whatever operation that triggers > > the COW behaviour should fail with ENOSPC before doing anything > > else.... > > Happy to implement this in dm-thinp. Each thin block will need a bit > to say if the block must be REQ_PROVISION'd at time of snapshot (and > the resulting block will need the same bit set). > > Walking all blocks of a thin device and triggering REQ_PROVISION for > each will obviously make thin snapshot creation take more time. > > I think this approach is better than having a dedicated bitmap hooked > off each thin device's metadata (with bitmap being copied and walked > at the time of snapshot). But we'll see... I'll get with Joe to > discuss further. > Hi Mike, If you recall our most recent discussions on this topic, I was thinking about the prospect of reserving the entire volume at mount time as an initial solution to this problem. When looking through some of the old reservation bits we prototyped years ago, it occurred to me that we have enough mechanism to actually prototype this. So FYI, I have some hacky prototype code that essentially has the filesystem at mount time tell dm it's using the volume and expects all further writes to succeed. dm-thin acquires reservation for the entire range of the volume for which writes would require block allocation (i.e., holes and shared dm blocks) or otherwise warns that the fs cannot be "safely" mounted. The reservation pool associates with the thin volume (not the filesystem), so if a snapshot is requested from dm, the snapshot request locates the snapshot origin and if it's currently active, increases the reservation pool to account for outstanding blocks that are about to become shared, or otherwise fails the snapshot with -ENOSPC. (I suspect discard needs similar treatment, but I hadn't got to that yet.). If the fs is not active, there is nothing to protect and so the snapshot proceeds as normal. This seems to work on my simple, initial tests for protecting actively mounted filesystems from dm-thin -ENOSPC. This definitely needs a sanity check from dm-thin folks, however, because I don't know enough about the broader subsystem to reason about whether it's sufficiently correct. I just managed to beat the older prototype code into submission to get it to do what I wanted on simple experiments. Thoughts on something like this? I think the main advantage is that it significantly reduces the requirements on the fs to track individual allocations. It's basically an on/off switch from the fs perspective, doesn't require any explicit provisioning whatsoever (though it can be done to improve things in the future) and in fact could probably be tied to thin volume activation to be made completely filesystem agnostic. Another advantage is that it requires no on-disk changes, no breaking COWs up front during snapshots, etc. The disadvantages are that it's space inefficient wrt to thin pool free space, but IIUC this is essentially what userspace management layers (such as Stratis) are doing today, they just put restrictions up front at volume configuration/creation time instead of at runtime. There also needs to be some kind of interface between the fs and dm. I suppose we could co-opt provision and discard primitives with a "reservation" modifier flag to get around that in a simple way, but that sounds potentially ugly. TBH, the more I think about this the more I think it makes sense to reserve on volume activation (with some caveats to allow a read-only mode, explicit bypass, etc.) and then let the cross-subsystem interface be dictated by granularity improvements... ... since I also happen to think there is a potentially interesting development path to make this sort of reserve pool configurable in terms of size and active/inactive state, which would allow the fs to use an emergency pool scheme for managing metadata provisioning and not have to track and provision individual metadata buffers at all (dealing with user data is much easier to provision explicitly). So the space inefficiency thing is potentially just a tradeoff for simplicity, and filesystems that want more granularity for better behavior could achieve that with more work. Filesystems that don't would be free to rely on the simple/basic mechanism provided by dm-thin and still have basic -ENOSPC protection with very minimal changes. That's getting too far into the weeds on the future bits, though. This is essentially 99% a dm-thin approach, so I'm mainly curious if there's sufficient interest in this sort of "reserve mode" approach to try and clean it up further and have dm guys look at it, or if you guys see any obvious issues in what it does that makes it potentially problematic, or if you would just prefer to go down the path described above... Brian > > Software devices like dm-thin/snapshot should really only need to > > keep a persistent map of the provisioned space and refresh space > > reservations for used space within that map whenever something that > > triggers COW behaviour occurs. i.e. a snapshot needs to reset the > > provisioned ranges back to "all ranges are freshly provisioned" > > before the snapshot is started. If that space is not available in > > the backing pool, then the snapshot attempt gets ENOSPC.... > > > > That means filesystems only need to provision space for journals and > > fixed metadata at mkfs time, and they only need issue a > > REQ_PROVISION bio when they first allocate over-write in place > > metadata. We already have online discard and/or fstrim for releasing > > provisioned space via discards. > > > > This will require some mods to filesystems like ext4 and XFS to > > issue REQ_PROVISION and fail gracefully during metadata allocation. > > However, doing so means that we can actually harden filesystems > > against sparse block device ENOSPC errors by ensuring they will > > never occur in critical filesystem structures.... > > Yes, let's finally _do_ this! ;) > > Mike >
On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote: > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote: > > On Fri, May 19 2023 at 7:07P -0400, > > Dave Chinner <david@fromorbit.com> wrote: > > ... > > > e.g. If the device takes a snapshot, it needs to reprovision the > > > potential COW ranges that overlap with the provisioned LBA range at > > > snapshot time. e.g. by re-reserving the space from the backing pool > > > for the provisioned space so if a COW occurs there is space > > > guaranteed for it to succeed. If there isn't space in the backing > > > pool for the reprovisioning, then whatever operation that triggers > > > the COW behaviour should fail with ENOSPC before doing anything > > > else.... > > > > Happy to implement this in dm-thinp. Each thin block will need a bit > > to say if the block must be REQ_PROVISION'd at time of snapshot (and > > the resulting block will need the same bit set). > > > > Walking all blocks of a thin device and triggering REQ_PROVISION for > > each will obviously make thin snapshot creation take more time. > > > > I think this approach is better than having a dedicated bitmap hooked > > off each thin device's metadata (with bitmap being copied and walked > > at the time of snapshot). But we'll see... I'll get with Joe to > > discuss further. > > > > Hi Mike, > > If you recall our most recent discussions on this topic, I was thinking > about the prospect of reserving the entire volume at mount time as an > initial solution to this problem. When looking through some of the old > reservation bits we prototyped years ago, it occurred to me that we have > enough mechanism to actually prototype this. > > So FYI, I have some hacky prototype code that essentially has the > filesystem at mount time tell dm it's using the volume and expects all > further writes to succeed. dm-thin acquires reservation for the entire > range of the volume for which writes would require block allocation > (i.e., holes and shared dm blocks) or otherwise warns that the fs cannot > be "safely" mounted. > > The reservation pool associates with the thin volume (not the > filesystem), so if a snapshot is requested from dm, the snapshot request > locates the snapshot origin and if it's currently active, increases the > reservation pool to account for outstanding blocks that are about to > become shared, or otherwise fails the snapshot with -ENOSPC. (I suspect > discard needs similar treatment, but I hadn't got to that yet.). If the > fs is not active, there is nothing to protect and so the snapshot > proceeds as normal. > > This seems to work on my simple, initial tests for protecting actively > mounted filesystems from dm-thin -ENOSPC. This definitely needs a sanity > check from dm-thin folks, however, because I don't know enough about the > broader subsystem to reason about whether it's sufficiently correct. I > just managed to beat the older prototype code into submission to get it > to do what I wanted on simple experiments. Feel free to share what you have. But my initial gut on the approach is: why even use thin provisioning at all if you're just going to reserve the entire logical address space of each thin device? > Thoughts on something like this? I think the main advantage is that it > significantly reduces the requirements on the fs to track individual > allocations. It's basically an on/off switch from the fs perspective, > doesn't require any explicit provisioning whatsoever (though it can be > done to improve things in the future) and in fact could probably be tied > to thin volume activation to be made completely filesystem agnostic. > Another advantage is that it requires no on-disk changes, no breaking > COWs up front during snapshots, etc. I'm just really unclear on the details without seeing it. You shared a roll-up of the code we did from years ago so I can kind of imagine the nature of the changes. I'm concerned about snapshots, and the implicit need to compound the reservation for each snapshot. > The disadvantages are that it's space inefficient wrt to thin pool free > space, but IIUC this is essentially what userspace management layers > (such as Stratis) are doing today, they just put restrictions up front > at volume configuration/creation time instead of at runtime. There also > needs to be some kind of interface between the fs and dm. I suppose we > could co-opt provision and discard primitives with a "reservation" > modifier flag to get around that in a simple way, but that sounds > potentially ugly. TBH, the more I think about this the more I think it > makes sense to reserve on volume activation (with some caveats to allow > a read-only mode, explicit bypass, etc.) and then let the > cross-subsystem interface be dictated by granularity improvements... It just feels imprecise to the point of being both excessive and nebulous. thin devices, and snapshots of them, can be active without associated filesystem mounts being active. It just takes a single origin volume to be mounted, with a snapshot active, to force thin blocks' sharing to be broken. > ... since I also happen to think there is a potentially interesting > development path to make this sort of reserve pool configurable in terms > of size and active/inactive state, which would allow the fs to use an > emergency pool scheme for managing metadata provisioning and not have to > track and provision individual metadata buffers at all (dealing with > user data is much easier to provision explicitly). So the space > inefficiency thing is potentially just a tradeoff for simplicity, and > filesystems that want more granularity for better behavior could achieve > that with more work. Filesystems that don't would be free to rely on the > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC > protection with very minimal changes. > > That's getting too far into the weeds on the future bits, though. This > is essentially 99% a dm-thin approach, so I'm mainly curious if there's > sufficient interest in this sort of "reserve mode" approach to try and > clean it up further and have dm guys look at it, or if you guys see any > obvious issues in what it does that makes it potentially problematic, or > if you would just prefer to go down the path described above... The model that Dave detailed, which builds on REQ_PROVISION and is sticky (by provisioning same blocks for snapshot) seems more useful to me because it is quite precise. That said, it doesn't account for hard requirements that _all_ blocks will always succeed. I'm really not sure we need to go to your extreme (even though stratis has.. difference is they did so as a crude means to an end because the existing filesystem code can easily get caught out by -ENOSPC at exactly the wrong time). Mike > > > Software devices like dm-thin/snapshot should really only need to > > > keep a persistent map of the provisioned space and refresh space > > > reservations for used space within that map whenever something that > > > triggers COW behaviour occurs. i.e. a snapshot needs to reset the > > > provisioned ranges back to "all ranges are freshly provisioned" > > > before the snapshot is started. If that space is not available in > > > the backing pool, then the snapshot attempt gets ENOSPC.... > > > > > > That means filesystems only need to provision space for journals and > > > fixed metadata at mkfs time, and they only need issue a > > > REQ_PROVISION bio when they first allocate over-write in place > > > metadata. We already have online discard and/or fstrim for releasing > > > provisioned space via discards. > > > > > > This will require some mods to filesystems like ext4 and XFS to > > > issue REQ_PROVISION and fail gracefully during metadata allocation. > > > However, doing so means that we can actually harden filesystems > > > against sparse block device ENOSPC errors by ensuring they will > > > never occur in critical filesystem structures.... > > > > Yes, let's finally _do_ this! ;) > > > > Mike > > > > -- > dm-devel mailing list > dm-devel@redhat.com > https://listman.redhat.com/mailman/listinfo/dm-devel >
On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote: > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote: > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote: > > ... since I also happen to think there is a potentially interesting > > development path to make this sort of reserve pool configurable in terms > > of size and active/inactive state, which would allow the fs to use an > > emergency pool scheme for managing metadata provisioning and not have to > > track and provision individual metadata buffers at all (dealing with > > user data is much easier to provision explicitly). So the space > > inefficiency thing is potentially just a tradeoff for simplicity, and > > filesystems that want more granularity for better behavior could achieve > > that with more work. Filesystems that don't would be free to rely on the > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC > > protection with very minimal changes. > > > > That's getting too far into the weeds on the future bits, though. This > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's > > sufficient interest in this sort of "reserve mode" approach to try and > > clean it up further and have dm guys look at it, or if you guys see any > > obvious issues in what it does that makes it potentially problematic, or > > if you would just prefer to go down the path described above... > > The model that Dave detailed, which builds on REQ_PROVISION and is > sticky (by provisioning same blocks for snapshot) seems more useful to > me because it is quite precise. That said, it doesn't account for > hard requirements that _all_ blocks will always succeed. Hmmm. Maybe I'm misunderstanding the "reserve pool" context here, but I don't think we'd ever need a hard guarantee from the block device that every write bio issued from the filesystem will succeed without ENOSPC. If the block device can provide a guarantee that a provisioned LBA range is always writable, then everything else is a filesystem level optimisation problem and we don't have to involve the block device in any way. All we need is a flag we can ready out of the bdev at mount time to determine if the filesystem should be operating with LBA provisioning enabled... e.g. If we need to "pre-provision" a chunk of the LBA space for filesystem metadata, we can do that ahead of time and track the pre-provisioned range(s) in the filesystem itself. In XFS, That could be as simple as having small chunks of each AG reserved to metadata (e.g. start with the first 100MB) and limiting all metadata allocation free space searches to that specific block range. When we run low on that space, we pre-provision another 100MB chunk and then allocate all metadata out of that new range. If we start getting ENOSPC to pre-provisioning, then we reduce the size of the regions and log low space warnings to userspace. If we can't pre-provision any space at all and we've completely run out, we simply declare ENOSPC for all incoming operations that require metadata allocation until pre-provisioning succeeds again. This is built entirely on the premise that once proactive backing device provisioning fails, the backing device is at ENOSPC and we have to wait for that situation to go away before allowing new data to be ingested. Hence the block device really doesn't need to know anything about what the filesystem is doing and vice versa - The block dev just says "yes" or "no" and the filesystem handles everything else. It's worth noting that XFS already has a coarse-grained implementation of preferred regions for metadata storage. It will currently not use those metadata-preferred regions for user data unless all the remaining user data space is full. Hence I'm pretty sure that a pre-provisioning enhancment like this can be done entirely in-memory without requiring any new on-disk state to be added. Sure, if we crash and remount, then we might chose a different LBA region for pre-provisioning. But that's not really a huge deal as we could also run an internal background post-mount fstrim operation to remove any unused pre-provisioning that was left over from when the system went down. Further, managing shared pool exhaustion doesn't require a reservation pool in the backing device and for the filesystems to request space from it. Filesystems already have their own reserve pools via pre-provisioning. If we want the filesystems to be able to release that space back to the shared pool (e.g. because the shared backing pool is critically short on space) then all we need is an extension to FITRIM to tell the filesystem to also release internal pre-provisioned reserves. Then the backing pool admin (person or automated daemon!) can simply issue a trim on all the filesystems in the pool and spce will be returned. Then filesystems will ask for new pre-provisioned space when they next need to ingest modifications, and the backing pool can manage the new pre-provisioning space requests directly.... Hence I think if we get the basic REQ_PROVISION overwrite-in-place guarantees defined and implemented as previously outlined, then we don't need any special coordination between the fs and block devices to avoid fatal ENOSPC issues with sparse and/or snapshot capable block devices... As a bonus, if we can implement the guarantees in dm-thin/-snapshot and have a filesystem make use of it, then we also have a reference implementation to point at device vendors and standards associations.... Cheers, Dave.
On Tue, May 23 2023 at 8:40P -0400, Dave Chinner <david@fromorbit.com> wrote: > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote: > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote: > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote: > > > ... since I also happen to think there is a potentially interesting > > > development path to make this sort of reserve pool configurable in terms > > > of size and active/inactive state, which would allow the fs to use an > > > emergency pool scheme for managing metadata provisioning and not have to > > > track and provision individual metadata buffers at all (dealing with > > > user data is much easier to provision explicitly). So the space > > > inefficiency thing is potentially just a tradeoff for simplicity, and > > > filesystems that want more granularity for better behavior could achieve > > > that with more work. Filesystems that don't would be free to rely on the > > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC > > > protection with very minimal changes. > > > > > > That's getting too far into the weeds on the future bits, though. This > > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's > > > sufficient interest in this sort of "reserve mode" approach to try and > > > clean it up further and have dm guys look at it, or if you guys see any > > > obvious issues in what it does that makes it potentially problematic, or > > > if you would just prefer to go down the path described above... > > > > The model that Dave detailed, which builds on REQ_PROVISION and is > > sticky (by provisioning same blocks for snapshot) seems more useful to > > me because it is quite precise. That said, it doesn't account for > > hard requirements that _all_ blocks will always succeed. > > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here, > but I don't think we'd ever need a hard guarantee from the block > device that every write bio issued from the filesystem will succeed > without ENOSPC. > > If the block device can provide a guarantee that a provisioned LBA > range is always writable, then everything else is a filesystem level > optimisation problem and we don't have to involve the block device > in any way. All we need is a flag we can ready out of the bdev at > mount time to determine if the filesystem should be operating with > LBA provisioning enabled... > > e.g. If we need to "pre-provision" a chunk of the LBA space for > filesystem metadata, we can do that ahead of time and track the > pre-provisioned range(s) in the filesystem itself. > > In XFS, That could be as simple as having small chunks of each AG > reserved to metadata (e.g. start with the first 100MB) and limiting > all metadata allocation free space searches to that specific block > range. When we run low on that space, we pre-provision another 100MB > chunk and then allocate all metadata out of that new range. If we > start getting ENOSPC to pre-provisioning, then we reduce the size of > the regions and log low space warnings to userspace. If we can't > pre-provision any space at all and we've completely run out, we > simply declare ENOSPC for all incoming operations that require > metadata allocation until pre-provisioning succeeds again. This is basically saying the same thing but: It could be that the LBA space is fragmented and so falling back to the smallest region size (that matches the thinp block size) would be the last resort? Then if/when thinp cannot even service allocating a new free thin block, dm-thinp will transition to out-of-data-space mode. > This is built entirely on the premise that once proactive backing > device provisioning fails, the backing device is at ENOSPC and we > have to wait for that situation to go away before allowing new data > to be ingested. Hence the block device really doesn't need to know > anything about what the filesystem is doing and vice versa - The > block dev just says "yes" or "no" and the filesystem handles > everything else. Yes. > It's worth noting that XFS already has a coarse-grained > implementation of preferred regions for metadata storage. It will > currently not use those metadata-preferred regions for user data > unless all the remaining user data space is full. Hence I'm pretty > sure that a pre-provisioning enhancment like this can be done > entirely in-memory without requiring any new on-disk state to be > added. > > Sure, if we crash and remount, then we might chose a different LBA > region for pre-provisioning. But that's not really a huge deal as we > could also run an internal background post-mount fstrim operation to > remove any unused pre-provisioning that was left over from when the > system went down. This would be the FITRIM with extension you mention below? Which is a filesystem interface detail? So dm-thinp would _not_ need to have new state that tracks "provisioned but unused" block? Nor would the block layer need an extra discard flag for a new class of "provisioned" blocks. If XFS tracked this "provisioned but unused" state, dm-thinp could just discard the block like its told. Would be nice to avoid dm-thinp needing to track "provisioned but unused". That said, dm-thinp does still need to know if a block was provisioned (given our previous designed discussion, to allow proper guarantees from this interface at snapshot time) so that XFS and other filesystems don't need to re-provision areas they already pre-provisioned. However, it may be that if thinp did track "provisioned but unused" it'd be useful to allow snapshots to share provisioned blocks that were never used. Meaning, we could then avoid "breaking sharing" at snapshot-time for "provisioned but unused" blocks. But allowing this "optimization" undercuts the gaurantee that XFS needs for thinp storage that allows snapshots... SO, I think I answered my own question: thinp doesnt need to track "provisioned but unused" blocks but we must always ensure snapshots inherit provisoned blocks ;) > Further, managing shared pool exhaustion doesn't require a > reservation pool in the backing device and for the filesystems to > request space from it. Filesystems already have their own reserve > pools via pre-provisioning. If we want the filesystems to be able to > release that space back to the shared pool (e.g. because the shared > backing pool is critically short on space) then all we need is an > extension to FITRIM to tell the filesystem to also release internal > pre-provisioned reserves. So by default FITRIM will _not_ discard provisioned blocks. Only if a flag is used will it result in discarding provisioned blocks. My dwelling on this is just double-checking that the > Then the backing pool admin (person or automated daemon!) can simply > issue a trim on all the filesystems in the pool and spce will be > returned. Then filesystems will ask for new pre-provisioned space > when they next need to ingest modifications, and the backing pool > can manage the new pre-provisioning space requests directly.... > > Hence I think if we get the basic REQ_PROVISION overwrite-in-place > guarantees defined and implemented as previously outlined, then we > don't need any special coordination between the fs and block devices > to avoid fatal ENOSPC issues with sparse and/or snapshot capable > block devices... > > As a bonus, if we can implement the guarantees in dm-thin/-snapshot > and have a filesystem make use of it, then we also have a reference > implementation to point at device vendors and standards > associations.... Yeap. Thanks, Mike
On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote: > On Tue, May 23 2023 at 8:40P -0400, > Dave Chinner <david@fromorbit.com> wrote: > > > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote: > > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote: > > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote: > > > > ... since I also happen to think there is a potentially interesting > > > > development path to make this sort of reserve pool configurable in terms > > > > of size and active/inactive state, which would allow the fs to use an > > > > emergency pool scheme for managing metadata provisioning and not have to > > > > track and provision individual metadata buffers at all (dealing with > > > > user data is much easier to provision explicitly). So the space > > > > inefficiency thing is potentially just a tradeoff for simplicity, and > > > > filesystems that want more granularity for better behavior could achieve > > > > that with more work. Filesystems that don't would be free to rely on the > > > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC > > > > protection with very minimal changes. > > > > > > > > That's getting too far into the weeds on the future bits, though. This > > > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's > > > > sufficient interest in this sort of "reserve mode" approach to try and > > > > clean it up further and have dm guys look at it, or if you guys see any > > > > obvious issues in what it does that makes it potentially problematic, or > > > > if you would just prefer to go down the path described above... > > > > > > The model that Dave detailed, which builds on REQ_PROVISION and is > > > sticky (by provisioning same blocks for snapshot) seems more useful to > > > me because it is quite precise. That said, it doesn't account for > > > hard requirements that _all_ blocks will always succeed. > > > > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here, > > but I don't think we'd ever need a hard guarantee from the block > > device that every write bio issued from the filesystem will succeed > > without ENOSPC. > > > > If the block device can provide a guarantee that a provisioned LBA > > range is always writable, then everything else is a filesystem level > > optimisation problem and we don't have to involve the block device > > in any way. All we need is a flag we can ready out of the bdev at > > mount time to determine if the filesystem should be operating with > > LBA provisioning enabled... > > > > e.g. If we need to "pre-provision" a chunk of the LBA space for > > filesystem metadata, we can do that ahead of time and track the > > pre-provisioned range(s) in the filesystem itself. > > > > In XFS, That could be as simple as having small chunks of each AG > > reserved to metadata (e.g. start with the first 100MB) and limiting > > all metadata allocation free space searches to that specific block > > range. When we run low on that space, we pre-provision another 100MB > > chunk and then allocate all metadata out of that new range. If we > > start getting ENOSPC to pre-provisioning, then we reduce the size of > > the regions and log low space warnings to userspace. If we can't > > pre-provision any space at all and we've completely run out, we > > simply declare ENOSPC for all incoming operations that require > > metadata allocation until pre-provisioning succeeds again. > > This is basically saying the same thing but: > > It could be that the LBA space is fragmented and so falling back to > the smallest region size (that matches the thinp block size) would be > the last resort? Then if/when thinp cannot even service allocating a > new free thin block, dm-thinp will transition to out-of-data-space > mode. Yes, something of that sort, though we'd probably give up if we can't get at least megabyte scale reservations - a single modification in XFS can modify many structures and require allocation of a lot of new metadata, so the fileystem cut-off would for metadata provisioning failure would be much larger than the dm-thinp region size.... > > This is built entirely on the premise that once proactive backing > > device provisioning fails, the backing device is at ENOSPC and we > > have to wait for that situation to go away before allowing new data > > to be ingested. Hence the block device really doesn't need to know > > anything about what the filesystem is doing and vice versa - The > > block dev just says "yes" or "no" and the filesystem handles > > everything else. > > Yes. > > > It's worth noting that XFS already has a coarse-grained > > implementation of preferred regions for metadata storage. It will > > currently not use those metadata-preferred regions for user data > > unless all the remaining user data space is full. Hence I'm pretty > > sure that a pre-provisioning enhancment like this can be done > > entirely in-memory without requiring any new on-disk state to be > > added. > > > > Sure, if we crash and remount, then we might chose a different LBA > > region for pre-provisioning. But that's not really a huge deal as we > > could also run an internal background post-mount fstrim operation to > > remove any unused pre-provisioning that was left over from when the > > system went down. > > This would be the FITRIM with extension you mention below? Which is a > filesystem interface detail? No. We might reuse some of the internal infrastructure we use to implement FITRIM, but that's about it. It's just something kinda like FITRIM but with different constraints determined by the filesystem rather than the user... As it is, I'm not sure we'd even need it - a preiodic userspace FITRIM would acheive the same result, so leaked provisioned spaces would get cleaned up eventually without the filesystem having to do anything specific... > So dm-thinp would _not_ need to have new > state that tracks "provisioned but unused" block? No idea - that's your domain. :) dm-snapshot, for certain, will need to track provisioned regions because it has to guarantee that overwrites to provisioned space in the origin device will always succeed. Hence it needs to know how much space breaking sharing in provisioned regions after a snapshot has been taken with be required... > Nor would the block > layer need an extra discard flag for a new class of "provisioned" > blocks. Right, I don't see that the discard operations need to care whether the underlying storage is provisioned. dm-thinp and dm-snapshot can treat REQ_OP_DISCARD as "this range is not longer in use" and do whatever they want with them. > If XFS tracked this "provisioned but unused" state, dm-thinp could > just discard the block like its told. Would be nice to avoid dm-thinp > needing to track "provisioned but unused". > > That said, dm-thinp does still need to know if a block was provisioned > (given our previous designed discussion, to allow proper guarantees > from this interface at snapshot time) so that XFS and other > filesystems don't need to re-provision areas they already > pre-provisioned. Right. I've simply assumed that dm-thinp would need to track entire provisioned regions - used or unused - so it knows which writes to empty or shared regions have a reservation to allow allocation to succeed when the backing pool is otherwise empty..... > However, it may be that if thinp did track "provisioned but unused" > it'd be useful to allow snapshots to share provisioned blocks that > were never used. Meaning, we could then avoid "breaking sharing" at > snapshot-time for "provisioned but unused" blocks. But allowing this > "optimization" undercuts the gaurantee that XFS needs for thinp > storage that allows snapshots... SO, I think I answered my own > question: thinp doesnt need to track "provisioned but unused" blocks > but we must always ensure snapshots inherit provisoned blocks ;) Sounds like a potential optimisation, but I haven't thought through a potential snapshot device implementation that far to comment sanely. I stopped once I got to the point where accounting tricks count be used to guarantee space is available for breaking sharing of used provisioned space after a snapshot was taken.... > > Further, managing shared pool exhaustion doesn't require a > > reservation pool in the backing device and for the filesystems to > > request space from it. Filesystems already have their own reserve > > pools via pre-provisioning. If we want the filesystems to be able to > > release that space back to the shared pool (e.g. because the shared > > backing pool is critically short on space) then all we need is an > > extension to FITRIM to tell the filesystem to also release internal > > pre-provisioned reserves. > > So by default FITRIM will _not_ discard provisioned blocks. Only if > a flag is used will it result in discarding provisioned blocks. No. FITRIM results in discard of any unused free space in the filesystem that matches the criteria set by the user. We don't care if free space was once provisioned used space - we'll issue a discard for the range regardless. The "special" FITRIM extension I mentioned is to get filesystem metadata provisioning released; that's completely separate to user data provisioning through fallocate() which FITRIM will always discard if it has been freed... IOWs, normal behaviour will be that a FITRIM ends up discarding a mix of unprovisioned and provisioned space. Nobody will be able to predict what mix the device is going to get at any point in time. Also, if we turn on online discard, the block device is going to get a constant stream of discard operations that will also be a mix of provisioned and unprovisioned space that is not longer in use by the filesystem. I suspect that you need to stop trying to double guess what operations the filesystem will use provisioning for, what it will send discards for and when it will send discards for them.. Just assume the device will receive a constant stream of both REQ_PROVISION and REQ_OP_DISCARD (for both provisioned and unprovisioned regions) operations whenver the filesystem is active on a thinp device..... Cheers, Dave.
On Thu, May 25 2023 at 7:39P -0400, Dave Chinner <david@fromorbit.com> wrote: > On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote: > > On Tue, May 23 2023 at 8:40P -0400, > > Dave Chinner <david@fromorbit.com> wrote: > > > > > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote: > > > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote: > > > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote: > > > > > ... since I also happen to think there is a potentially interesting > > > > > development path to make this sort of reserve pool configurable in terms > > > > > of size and active/inactive state, which would allow the fs to use an > > > > > emergency pool scheme for managing metadata provisioning and not have to > > > > > track and provision individual metadata buffers at all (dealing with > > > > > user data is much easier to provision explicitly). So the space > > > > > inefficiency thing is potentially just a tradeoff for simplicity, and > > > > > filesystems that want more granularity for better behavior could achieve > > > > > that with more work. Filesystems that don't would be free to rely on the > > > > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC > > > > > protection with very minimal changes. > > > > > > > > > > That's getting too far into the weeds on the future bits, though. This > > > > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's > > > > > sufficient interest in this sort of "reserve mode" approach to try and > > > > > clean it up further and have dm guys look at it, or if you guys see any > > > > > obvious issues in what it does that makes it potentially problematic, or > > > > > if you would just prefer to go down the path described above... > > > > > > > > The model that Dave detailed, which builds on REQ_PROVISION and is > > > > sticky (by provisioning same blocks for snapshot) seems more useful to > > > > me because it is quite precise. That said, it doesn't account for > > > > hard requirements that _all_ blocks will always succeed. > > > > > > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here, > > > but I don't think we'd ever need a hard guarantee from the block > > > device that every write bio issued from the filesystem will succeed > > > without ENOSPC. > > > > > > If the block device can provide a guarantee that a provisioned LBA > > > range is always writable, then everything else is a filesystem level > > > optimisation problem and we don't have to involve the block device > > > in any way. All we need is a flag we can ready out of the bdev at > > > mount time to determine if the filesystem should be operating with > > > LBA provisioning enabled... > > > > > > e.g. If we need to "pre-provision" a chunk of the LBA space for > > > filesystem metadata, we can do that ahead of time and track the > > > pre-provisioned range(s) in the filesystem itself. > > > > > > In XFS, That could be as simple as having small chunks of each AG > > > reserved to metadata (e.g. start with the first 100MB) and limiting > > > all metadata allocation free space searches to that specific block > > > range. When we run low on that space, we pre-provision another 100MB > > > chunk and then allocate all metadata out of that new range. If we > > > start getting ENOSPC to pre-provisioning, then we reduce the size of > > > the regions and log low space warnings to userspace. If we can't > > > pre-provision any space at all and we've completely run out, we > > > simply declare ENOSPC for all incoming operations that require > > > metadata allocation until pre-provisioning succeeds again. > > > > This is basically saying the same thing but: > > > > It could be that the LBA space is fragmented and so falling back to > > the smallest region size (that matches the thinp block size) would be > > the last resort? Then if/when thinp cannot even service allocating a > > new free thin block, dm-thinp will transition to out-of-data-space > > mode. > > Yes, something of that sort, though we'd probably give up if we > can't get at least megabyte scale reservations - a single > modification in XFS can modify many structures and require > allocation of a lot of new metadata, so the fileystem cut-off would > for metadata provisioning failure would be much larger than the > dm-thinp region size.... > > > > This is built entirely on the premise that once proactive backing > > > device provisioning fails, the backing device is at ENOSPC and we > > > have to wait for that situation to go away before allowing new data > > > to be ingested. Hence the block device really doesn't need to know > > > anything about what the filesystem is doing and vice versa - The > > > block dev just says "yes" or "no" and the filesystem handles > > > everything else. > > > > Yes. > > > > > It's worth noting that XFS already has a coarse-grained > > > implementation of preferred regions for metadata storage. It will > > > currently not use those metadata-preferred regions for user data > > > unless all the remaining user data space is full. Hence I'm pretty > > > sure that a pre-provisioning enhancment like this can be done > > > entirely in-memory without requiring any new on-disk state to be > > > added. > > > > > > Sure, if we crash and remount, then we might chose a different LBA > > > region for pre-provisioning. But that's not really a huge deal as we > > > could also run an internal background post-mount fstrim operation to > > > remove any unused pre-provisioning that was left over from when the > > > system went down. > > > > This would be the FITRIM with extension you mention below? Which is a > > filesystem interface detail? > > No. We might reuse some of the internal infrastructure we use to > implement FITRIM, but that's about it. It's just something kinda > like FITRIM but with different constraints determined by the > filesystem rather than the user... > > As it is, I'm not sure we'd even need it - a preiodic userspace > FITRIM would acheive the same result, so leaked provisioned spaces > would get cleaned up eventually without the filesystem having to do > anything specific... > > > So dm-thinp would _not_ need to have new > > state that tracks "provisioned but unused" block? > > No idea - that's your domain. :) > > dm-snapshot, for certain, will need to track provisioned regions > because it has to guarantee that overwrites to provisioned space in > the origin device will always succeed. Hence it needs to know how > much space breaking sharing in provisioned regions after a snapshot > has been taken with be required... dm-thinp offers its own much more scalable snapshot support (doesn't use old dm-snapshot N-way copyout target). dm-snapshot isn't going to be modified to support this level of hardening (dm-snapshot is basically in "maintenance only" now). But I understand your meaning: what you said is 100% applicable to dm-thinp's snapshot implementation and needs to be accounted for in thinp's metadata (inherent 'provisioned' flag). > > Nor would the block > > layer need an extra discard flag for a new class of "provisioned" > > blocks. > > Right, I don't see that the discard operations need to care whether > the underlying storage is provisioned. dm-thinp and dm-snapshot can > treat REQ_OP_DISCARD as "this range is not longer in use" and do > whatever they want with them. > > > If XFS tracked this "provisioned but unused" state, dm-thinp could > > just discard the block like its told. Would be nice to avoid dm-thinp > > needing to track "provisioned but unused". > > > > That said, dm-thinp does still need to know if a block was provisioned > > (given our previous designed discussion, to allow proper guarantees > > from this interface at snapshot time) so that XFS and other > > filesystems don't need to re-provision areas they already > > pre-provisioned. > > Right. > > I've simply assumed that dm-thinp would need to track entire > provisioned regions - used or unused - so it knows which writes to > empty or shared regions have a reservation to allow allocation to > succeed when the backing pool is otherwise empty..... > > > However, it may be that if thinp did track "provisioned but unused" > > it'd be useful to allow snapshots to share provisioned blocks that > > were never used. Meaning, we could then avoid "breaking sharing" at > > snapshot-time for "provisioned but unused" blocks. But allowing this > > "optimization" undercuts the gaurantee that XFS needs for thinp > > storage that allows snapshots... SO, I think I answered my own > > question: thinp doesnt need to track "provisioned but unused" blocks > > but we must always ensure snapshots inherit provisoned blocks ;) > > Sounds like a potential optimisation, but I haven't thought through > a potential snapshot device implementation that far to comment > sanely. I stopped once I got to the point where accounting tricks > count be used to guarantee space is available for breaking sharing > of used provisioned space after a snapshot was taken.... > > > > Further, managing shared pool exhaustion doesn't require a > > > reservation pool in the backing device and for the filesystems to > > > request space from it. Filesystems already have their own reserve > > > pools via pre-provisioning. If we want the filesystems to be able to > > > release that space back to the shared pool (e.g. because the shared > > > backing pool is critically short on space) then all we need is an > > > extension to FITRIM to tell the filesystem to also release internal > > > pre-provisioned reserves. > > > > So by default FITRIM will _not_ discard provisioned blocks. Only if > > a flag is used will it result in discarding provisioned blocks. > > No. FITRIM results in discard of any unused free space in the > filesystem that matches the criteria set by the user. We don't care > if free space was once provisioned used space - we'll issue a > discard for the range regardless. The "special" FITRIM extension I > mentioned is to get filesystem metadata provisioning released; > that's completely separate to user data provisioning through > fallocate() which FITRIM will always discard if it has been freed... > > IOWs, normal behaviour will be that a FITRIM ends up discarding a > mix of unprovisioned and provisioned space. Nobody will be able to > predict what mix the device is going to get at any point in time. > Also, if we turn on online discard, the block device is going to get > a constant stream of discard operations that will also be a mix of > provisioned and unprovisioned space that is not longer in use by the > filesystem. > > I suspect that you need to stop trying to double guess what > operations the filesystem will use provisioning for, what it will > send discards for and when it will send discards for them.. Just > assume the device will receive a constant stream of both > REQ_PROVISION and REQ_OP_DISCARD (for both provisioned and > unprovisioned regions) operations whenver the filesystem is active > on a thinp device..... Yeah, I was getting tripped up in the weeds a bit. It's pretty straight-forward (and like I said at the start of our subthread here: this follow-on work, to inherit provisioned flag, can build on this REQ_PROVISION patchset). All said, I've now gotten this sub-thread on Joe Thornber's radar and we've started discussing. We'll be discussing with more focus tomorrow. Mike
On Wed, May 24, 2023 at 10:40:34AM +1000, Dave Chinner wrote: > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote: > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote: > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote: > > > ... since I also happen to think there is a potentially interesting > > > development path to make this sort of reserve pool configurable in terms > > > of size and active/inactive state, which would allow the fs to use an > > > emergency pool scheme for managing metadata provisioning and not have to > > > track and provision individual metadata buffers at all (dealing with > > > user data is much easier to provision explicitly). So the space > > > inefficiency thing is potentially just a tradeoff for simplicity, and > > > filesystems that want more granularity for better behavior could achieve > > > that with more work. Filesystems that don't would be free to rely on the > > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC > > > protection with very minimal changes. > > > > > > That's getting too far into the weeds on the future bits, though. This > > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's > > > sufficient interest in this sort of "reserve mode" approach to try and > > > clean it up further and have dm guys look at it, or if you guys see any > > > obvious issues in what it does that makes it potentially problematic, or > > > if you would just prefer to go down the path described above... > > > > The model that Dave detailed, which builds on REQ_PROVISION and is > > sticky (by provisioning same blocks for snapshot) seems more useful to > > me because it is quite precise. That said, it doesn't account for > > hard requirements that _all_ blocks will always succeed. > > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here, > but I don't think we'd ever need a hard guarantee from the block > device that every write bio issued from the filesystem will succeed > without ENOSPC. > The bigger picture goal that I didn't get into in my previous mail is the "full device" reservation model is intended to be a simple, crude reference implementation that can be enabled for any arbitrary thin volume consumer (filesystem or application). The idea is to build that on a simple enough reservation mechanism that any such consumer could override it based on its own operational model. The goal is to guarantee that a particular filesystem never receives -ENOSPC from dm-thin on writes, but the first phase of implementing that is to simply guarantee every block is writeable. As a specific filesystem is able to more explicitly provision its own allocations in a way that it can guarantee to return -ENOSPC from dm-thin up front (rather than at write bio time), it can reduce the need for the amount of reservation required, ultimately to zero if that filesystem provides the ability to pre-provision all of its writes to storage in some way or another. I think for filesystems with complex metadata management like XFS, it's not very realistic to expect explicit 1-1 provisioning for all metadata changes on a per-transaction basis in the same way that can (fairly easily) be done for data, which means a pool mechanism is probably still needed for the metadata class of writes. Therefore, my expectation for something like XFS is that it grows the ability to explicitly provision data writes up front (we solved this part years ago), and then uses a much smaller pool of reservation for the purpose of dealing with metadata. I think what you describe below around preprovisioned perag metadata ranges is interesting because it _very closely_ maps conceptually to what I envisioned the evolution of the reserve pool scheme to end up looking like, but just implemented rather differently to use reservations instead of specific LBA ranges. Let me try to connect the dots and identify the differences/tradeoffs... > If the block device can provide a guarantee that a provisioned LBA > range is always writable, then everything else is a filesystem level > optimisation problem and we don't have to involve the block device > in any way. All we need is a flag we can ready out of the bdev at > mount time to determine if the filesystem should be operating with > LBA provisioning enabled... > > e.g. If we need to "pre-provision" a chunk of the LBA space for > filesystem metadata, we can do that ahead of time and track the > pre-provisioned range(s) in the filesystem itself. > > In XFS, That could be as simple as having small chunks of each AG > reserved to metadata (e.g. start with the first 100MB) and limiting > all metadata allocation free space searches to that specific block > range. When we run low on that space, we pre-provision another 100MB > chunk and then allocate all metadata out of that new range. If we > start getting ENOSPC to pre-provisioning, then we reduce the size of > the regions and log low space warnings to userspace. If we can't > pre-provision any space at all and we've completely run out, we > simply declare ENOSPC for all incoming operations that require > metadata allocation until pre-provisioning succeeds again. > The more interesting aspect of this is not so much how space is provisioned and allocated, but how the filesystem is going to consume that space in a way that guarantees -ENOSPC is provided up front before userspace is allowed to make modifications. You didn't really touch on that here, so I'm going to assume we'd have something like a perag counter of how many free blocks currently live in preprovisioned ranges, and then an fs-wide total somewhere so a transaction has the ability to consume these blocks at trans reservation time, the fs knows when to preprovision more space (or go into -ENOSPC mode), etc. Some accounting of that nature is necessary here in order to prevent the filesystem from ever writing to unprovisioned space. So what I was envisioning is rather than explicitly preprovision a physical range of each AG and tracking all that, just reserve that number of arbitrarily located blocks from dm for each AG. The initial perag reservations can be populated at mount time, replenished as needed in a very similar way as what you describe, and 100% released back to the thin pool at unmount time. On top of that, there's no need to track physical preprovisioned ranges at all. Not just for allocation purposes, but also to avoid things like having to protect background trims from preprovisioned ranges of free space dedicated for metadata, etc. > This is built entirely on the premise that once proactive backing > device provisioning fails, the backing device is at ENOSPC and we > have to wait for that situation to go away before allowing new data > to be ingested. Hence the block device really doesn't need to know > anything about what the filesystem is doing and vice versa - The > block dev just says "yes" or "no" and the filesystem handles > everything else. > Yup, everything you describe about going into a simulated -ENOSPC mode would work exactly the same. The primary difference is that the internal provisioned space accounting in the filesystem is backed by dynamic reservation in dm, rather than physically provisioned LBA ranges. > It's worth noting that XFS already has a coarse-grained > implementation of preferred regions for metadata storage. It will > currently not use those metadata-preferred regions for user data > unless all the remaining user data space is full. Hence I'm pretty > sure that a pre-provisioning enhancment like this can be done > entirely in-memory without requiring any new on-disk state to be > added. > > Sure, if we crash and remount, then we might chose a different LBA > region for pre-provisioning. But that's not really a huge deal as we > could also run an internal background post-mount fstrim operation to > remove any unused pre-provisioning that was left over from when the > system went down. > None of this is really needed.. > Further, managing shared pool exhaustion doesn't require a > reservation pool in the backing device and for the filesystems to > request space from it. Filesystems already have their own reserve > pools via pre-provisioning. If we want the filesystems to be able to > release that space back to the shared pool (e.g. because the shared > backing pool is critically short on space) then all we need is an > extension to FITRIM to tell the filesystem to also release internal > pre-provisioned reserves. > > Then the backing pool admin (person or automated daemon!) can simply > issue a trim on all the filesystems in the pool and spce will be > returned. Then filesystems will ask for new pre-provisioned space > when they next need to ingest modifications, and the backing pool > can manage the new pre-provisioning space requests directly.... > This is written as to imply that the reservation pool is some big complex thing, which makes me think there is some confusion/miscommunication. It's basically just an in memory counter of space that is allocated out of a shared thin pool and is held in a specific thin volume while it is currently in use. The counter on the volume is managed indirectly by filesystem requests and/or direct operations on the volume (like dm snapshots). Sure, you could replace the counter and reservation interface with explicitly provisioned/trimmed LBA ranges that the fs can manage to provide -ENOSPC guarantees, but then the fs has to do those various things you've mentioned: - Provision those ranges in the fs and change allocation behavior accordingly. - Do the background post-crash fitrim preprovision clean up thing. - Distinguish between trims that are intended to return preprovisioned space vs. those that come from userspace. - Have some daemon or whatever (?) responsible for communicating the need for trims in the fs to return space back to the pool. Then this still depends on changing how dm thin snapshots work and needs a way to deal with delayed allocation to actually guarantee -ENOSPC protection..? > Hence I think if we get the basic REQ_PROVISION overwrite-in-place > guarantees defined and implemented as previously outlined, then we > don't need any special coordination between the fs and block devices > to avoid fatal ENOSPC issues with sparse and/or snapshot capable > block devices... > This all sounds like a good amount of coordination and unnecessary complexity to me. What I was thinking as a next phase (i.e. after initial phase full device reservation) approach for a filesystem like XFS would be something like this. - Support a mount option for a configurable size metadata reservation pool (with sane/conservative default). - The pool is populated at mount time, else the fs goes right into simulated -ENOSPC mode. - Thin pool reservation consumption is controlled by a flag on write bios that is managed by the fs (flag polarity TBD). - All fs data writes are explicitly reserved up front in the write path. Delalloc maps to explicit reservation, overwrites are easy and just involve an explicit provision. - Metadata writes are not reserved or provisioned at all. They allocate out of the thin pool on write (if needed), just as they do today. On an -ENOSPC metadata write error, the fs goes into simulated -ENOSPC mode and allows outstanding metadata writes to now use the bio flag to consume emergency reservation. So this means that metadata -ENOSPC protection is only as reliable as the size of the specified pool. This is by design, so the filesystem still does not have to track provisioning, allocation or overwrites of its own metadata usage. Users with metadata heavy workloads or who happen to be sensitive to -ENOSPC errors can be more aggressive with pool size, while other users might be able to get away with a smaller pool. Users who are super paranoid and want perfection can continue to reserve the entire device and pay for the extra storage. Users who are not sure can test their workload in an appropriate environment, collect some data/metrics on maximum outstanding dirty metadata, and then use that as a baseline/minimum pool size for reliable behavior going forward. This is also where something like Stratis can come in to generate this sort of information, make recommendations or implement heuristics (based on things like fs size, amount of RAM, for e.g.) to provide sane defaults based on use case. I.e., this is initially exposed as a userspace/tuning issue instead of a filesystem/dm-thin hard guarantee. Finally, if you really want to get to that last step of maximally efficient and safe provisioning in the fs, implement a 'thinreserve=adaptive' mode in the fs that alters the acquisition and consumption of dm-thin reserved blocks to be adaptive in nature and promises to do it's own usage throttling against outstanding reservation. I think this is the mode that most closely resembles your preprovisioned range mechanism. For example, adaptive mode could add the logic/complexity where you do the per-ag provision thing (just using reservation instead of physical ranges), change the transaction path to attempt to increase the reservation pool or go into -ENOSPC mode, and flag all writes to be satisfied from the reserve pool (because you've done the provision/reservation up front). At this point the "reserve pool" concept is very different and pretty much managed entirely by the filesystem. It's just a counter of the set of blocks the fs is anticipating to write to in the near term, but it's built on the same underlying reservation mechanism used differently by other filesystems. So something like ext4 could elide the need for an adaptive mode, implement the moderate data/metadata pool mechanism and rely on userspace tools or qualified administrators to do the sizing correctly, while simultaneously using the same underlying mechanism that XFS is using for finer grained provisioning. > As a bonus, if we can implement the guarantees in dm-thin/-snapshot > and have a filesystem make use of it, then we also have a reference > implementation to point at device vendors and standards > associations.... > I think that's a ways ahead yet.. :P Thoughts on any of the above? Does that describe enough of the big picture? (Mike, I hope this at least addresses the whole "why even do this?" question). I am deliberately trying to work through a progression that starts simple and generic but actually 100% solves the problem (even if in a dumb way), then iterates to something that addresses the biggest drawback with the reference implementation with minimal changes required to individual filesystems (i.e. metadata pool sizing), and finally ends up allowing any particular filesystem to refine from there to achieve maximal efficiency based on its own cost/benefit analysis. Another way to look at it is... step 1 is to implement the 'thinreserve=full' mount option, which can be trivially implemented by any filesystem with a couple function calls. Step two is to implement 'thinsreserve=N' support, which consists of a standard iomap provisioning implementation for data and a bio tagging/error handling approach that is still pretty simple for most filesystems to implement. Finally, 'thinreserve=adaptive' is the filesystems best effort to guarantee -ENOSPC safety with maximal space efficiency. One general tradeoff with using reservations vs. preprovisioning is the the latter can just use the provision/trim primitives to alloc/free LBA ranges. My thought on that is those primitives could possibly be modified to do the same sort of things with reservation as for physical allocations. That seems fairly easy to do with bio op flags/modifiers, though one thing I'm not sure about is how to submit a provision bio to request a certain amount location agnostic blocks. I'd have to investigate that more. So in summary, I can sort of see how this problem could be solved with this combination of physically preprovisioned ranges and changes to dm-thin snapshot behavior and whatnot (I'm still missing how this is supposed to handle delalloc, mostly), but I think that involves more complexity and customization work than is really necessary. Either way, this is a distinctly different approach to what I was thinking of morphing the prototype bits into. So to me the relevant question is does something like the progression that is outlined above for a block reservation approach seem a reasonable path to ultimately be able to accomplish the same sort of results in the fs? If so, then I'm happy to try and push things in that direction to at least try to prove it out. If not, then I'm also happy to just leave it alone.. ;) Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Thu, May 25, 2023 at 9:00 AM Mike Snitzer <snitzer@kernel.org> wrote: > > On Thu, May 25 2023 at 7:39P -0400, > Dave Chinner <david@fromorbit.com> wrote: > > > On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote: > > > On Tue, May 23 2023 at 8:40P -0400, > > > Dave Chinner <david@fromorbit.com> wrote: > > > > > > > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote: > > > > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote: > > > > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote: > > > > > > ... since I also happen to think there is a potentially interesting > > > > > > development path to make this sort of reserve pool configurable in terms > > > > > > of size and active/inactive state, which would allow the fs to use an > > > > > > emergency pool scheme for managing metadata provisioning and not have to > > > > > > track and provision individual metadata buffers at all (dealing with > > > > > > user data is much easier to provision explicitly). So the space > > > > > > inefficiency thing is potentially just a tradeoff for simplicity, and > > > > > > filesystems that want more granularity for better behavior could achieve > > > > > > that with more work. Filesystems that don't would be free to rely on the > > > > > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC > > > > > > protection with very minimal changes. > > > > > > > > > > > > That's getting too far into the weeds on the future bits, though. This > > > > > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's > > > > > > sufficient interest in this sort of "reserve mode" approach to try and > > > > > > clean it up further and have dm guys look at it, or if you guys see any > > > > > > obvious issues in what it does that makes it potentially problematic, or > > > > > > if you would just prefer to go down the path described above... > > > > > > > > > > The model that Dave detailed, which builds on REQ_PROVISION and is > > > > > sticky (by provisioning same blocks for snapshot) seems more useful to > > > > > me because it is quite precise. That said, it doesn't account for > > > > > hard requirements that _all_ blocks will always succeed. > > > > > > > > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here, > > > > but I don't think we'd ever need a hard guarantee from the block > > > > device that every write bio issued from the filesystem will succeed > > > > without ENOSPC. > > > > > > > > If the block device can provide a guarantee that a provisioned LBA > > > > range is always writable, then everything else is a filesystem level > > > > optimisation problem and we don't have to involve the block device > > > > in any way. All we need is a flag we can ready out of the bdev at > > > > mount time to determine if the filesystem should be operating with > > > > LBA provisioning enabled... > > > > > > > > e.g. If we need to "pre-provision" a chunk of the LBA space for > > > > filesystem metadata, we can do that ahead of time and track the > > > > pre-provisioned range(s) in the filesystem itself. > > > > > > > > In XFS, That could be as simple as having small chunks of each AG > > > > reserved to metadata (e.g. start with the first 100MB) and limiting > > > > all metadata allocation free space searches to that specific block > > > > range. When we run low on that space, we pre-provision another 100MB > > > > chunk and then allocate all metadata out of that new range. If we > > > > start getting ENOSPC to pre-provisioning, then we reduce the size of > > > > the regions and log low space warnings to userspace. If we can't > > > > pre-provision any space at all and we've completely run out, we > > > > simply declare ENOSPC for all incoming operations that require > > > > metadata allocation until pre-provisioning succeeds again. > > > > > > This is basically saying the same thing but: > > > > > > It could be that the LBA space is fragmented and so falling back to > > > the smallest region size (that matches the thinp block size) would be > > > the last resort? Then if/when thinp cannot even service allocating a > > > new free thin block, dm-thinp will transition to out-of-data-space > > > mode. > > > > Yes, something of that sort, though we'd probably give up if we > > can't get at least megabyte scale reservations - a single > > modification in XFS can modify many structures and require > > allocation of a lot of new metadata, so the fileystem cut-off would > > for metadata provisioning failure would be much larger than the > > dm-thinp region size.... > > > > > > This is built entirely on the premise that once proactive backing > > > > device provisioning fails, the backing device is at ENOSPC and we > > > > have to wait for that situation to go away before allowing new data > > > > to be ingested. Hence the block device really doesn't need to know > > > > anything about what the filesystem is doing and vice versa - The > > > > block dev just says "yes" or "no" and the filesystem handles > > > > everything else. > > > > > > Yes. > > > > > > > It's worth noting that XFS already has a coarse-grained > > > > implementation of preferred regions for metadata storage. It will > > > > currently not use those metadata-preferred regions for user data > > > > unless all the remaining user data space is full. Hence I'm pretty > > > > sure that a pre-provisioning enhancment like this can be done > > > > entirely in-memory without requiring any new on-disk state to be > > > > added. > > > > > > > > Sure, if we crash and remount, then we might chose a different LBA > > > > region for pre-provisioning. But that's not really a huge deal as we > > > > could also run an internal background post-mount fstrim operation to > > > > remove any unused pre-provisioning that was left over from when the > > > > system went down. > > > > > > This would be the FITRIM with extension you mention below? Which is a > > > filesystem interface detail? > > > > No. We might reuse some of the internal infrastructure we use to > > implement FITRIM, but that's about it. It's just something kinda > > like FITRIM but with different constraints determined by the > > filesystem rather than the user... > > > > As it is, I'm not sure we'd even need it - a preiodic userspace > > FITRIM would acheive the same result, so leaked provisioned spaces > > would get cleaned up eventually without the filesystem having to do > > anything specific... > > > > > So dm-thinp would _not_ need to have new > > > state that tracks "provisioned but unused" block? > > > > No idea - that's your domain. :) > > > > dm-snapshot, for certain, will need to track provisioned regions > > because it has to guarantee that overwrites to provisioned space in > > the origin device will always succeed. Hence it needs to know how > > much space breaking sharing in provisioned regions after a snapshot > > has been taken with be required... > > dm-thinp offers its own much more scalable snapshot support (doesn't > use old dm-snapshot N-way copyout target). > > dm-snapshot isn't going to be modified to support this level of > hardening (dm-snapshot is basically in "maintenance only" now). > > But I understand your meaning: what you said is 100% applicable to > dm-thinp's snapshot implementation and needs to be accounted for in > thinp's metadata (inherent 'provisioned' flag). > A bit orthogonal: would dm-thinp need to differentiate between user-triggered provision requests (eg. from fallocate()) vs fs-triggered requests? I would lean towards user provisioned areas not getting dedup'd on snapshot creation, but that would entail tracking the state of the original request and possibly a provision request flag (REQ_PROVISION_DEDUP_ON_SNAPSHOT) or an inverse flag (REQ_PROVISION_NODEDUP). Possibly too convoluted... > > > Nor would the block > > > layer need an extra discard flag for a new class of "provisioned" > > > blocks. > > > > Right, I don't see that the discard operations need to care whether > > the underlying storage is provisioned. dm-thinp and dm-snapshot can > > treat REQ_OP_DISCARD as "this range is not longer in use" and do > > whatever they want with them. > > > > > If XFS tracked this "provisioned but unused" state, dm-thinp could > > > just discard the block like its told. Would be nice to avoid dm-thinp > > > needing to track "provisioned but unused". > > > > > > That said, dm-thinp does still need to know if a block was provisioned > > > (given our previous designed discussion, to allow proper guarantees > > > from this interface at snapshot time) so that XFS and other > > > filesystems don't need to re-provision areas they already > > > pre-provisioned. > > > > Right. > > > > I've simply assumed that dm-thinp would need to track entire > > provisioned regions - used or unused - so it knows which writes to > > empty or shared regions have a reservation to allow allocation to > > succeed when the backing pool is otherwise empty..... > > > > > However, it may be that if thinp did track "provisioned but unused" > > > it'd be useful to allow snapshots to share provisioned blocks that > > > were never used. Meaning, we could then avoid "breaking sharing" at > > > snapshot-time for "provisioned but unused" blocks. But allowing this > > > "optimization" undercuts the gaurantee that XFS needs for thinp > > > storage that allows snapshots... SO, I think I answered my own > > > question: thinp doesnt need to track "provisioned but unused" blocks > > > but we must always ensure snapshots inherit provisoned blocks ;) > > > > Sounds like a potential optimisation, but I haven't thought through > > a potential snapshot device implementation that far to comment > > sanely. I stopped once I got to the point where accounting tricks > > count be used to guarantee space is available for breaking sharing > > of used provisioned space after a snapshot was taken.... > > > > > > Further, managing shared pool exhaustion doesn't require a > > > > reservation pool in the backing device and for the filesystems to > > > > request space from it. Filesystems already have their own reserve > > > > pools via pre-provisioning. If we want the filesystems to be able to > > > > release that space back to the shared pool (e.g. because the shared > > > > backing pool is critically short on space) then all we need is an > > > > extension to FITRIM to tell the filesystem to also release internal > > > > pre-provisioned reserves. > > > > > > So by default FITRIM will _not_ discard provisioned blocks. Only if > > > a flag is used will it result in discarding provisioned blocks. > > > > No. FITRIM results in discard of any unused free space in the > > filesystem that matches the criteria set by the user. We don't care > > if free space was once provisioned used space - we'll issue a > > discard for the range regardless. The "special" FITRIM extension I > > mentioned is to get filesystem metadata provisioning released; > > that's completely separate to user data provisioning through > > fallocate() which FITRIM will always discard if it has been freed... > > > > IOWs, normal behaviour will be that a FITRIM ends up discarding a > > mix of unprovisioned and provisioned space. Nobody will be able to > > predict what mix the device is going to get at any point in time. > > Also, if we turn on online discard, the block device is going to get > > a constant stream of discard operations that will also be a mix of > > provisioned and unprovisioned space that is not longer in use by the > > filesystem. > > > > I suspect that you need to stop trying to double guess what > > operations the filesystem will use provisioning for, what it will > > send discards for and when it will send discards for them.. Just > > assume the device will receive a constant stream of both > > REQ_PROVISION and REQ_OP_DISCARD (for both provisioned and > > unprovisioned regions) operations whenver the filesystem is active > > on a thinp device..... > > Yeah, I was getting tripped up in the weeds a bit. It's pretty > straight-forward (and like I said at the start of our subthread here: > this follow-on work, to inherit provisioned flag, can build on this > REQ_PROVISION patchset). > > All said, I've now gotten this sub-thread on Joe Thornber's radar and > we've started discussing. We'll be discussing with more focus > tomorrow. > From the perspective of this patch series, I'll wait for more feedback before sending out v8 (which would be the above patches and the follow-on patch to pass through FALLOC_FL_UNSHARE_RANGE [1]). [1] https://listman.redhat.com/archives/dm-devel/2023-May/054188.html Thanks! Sarthak > Mike
On Thu, May 25, 2023 at 03:47:21PM -0700, Sarthak Kukreti wrote: > On Thu, May 25, 2023 at 9:00 AM Mike Snitzer <snitzer@kernel.org> wrote: > > On Thu, May 25 2023 at 7:39P -0400, > > Dave Chinner <david@fromorbit.com> wrote: > > > On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote: > > > > On Tue, May 23 2023 at 8:40P -0400, > > > > Dave Chinner <david@fromorbit.com> wrote: > > > > > It's worth noting that XFS already has a coarse-grained > > > > > implementation of preferred regions for metadata storage. It will > > > > > currently not use those metadata-preferred regions for user data > > > > > unless all the remaining user data space is full. Hence I'm pretty > > > > > sure that a pre-provisioning enhancment like this can be done > > > > > entirely in-memory without requiring any new on-disk state to be > > > > > added. > > > > > > > > > > Sure, if we crash and remount, then we might chose a different LBA > > > > > region for pre-provisioning. But that's not really a huge deal as we > > > > > could also run an internal background post-mount fstrim operation to > > > > > remove any unused pre-provisioning that was left over from when the > > > > > system went down. > > > > > > > > This would be the FITRIM with extension you mention below? Which is a > > > > filesystem interface detail? > > > > > > No. We might reuse some of the internal infrastructure we use to > > > implement FITRIM, but that's about it. It's just something kinda > > > like FITRIM but with different constraints determined by the > > > filesystem rather than the user... > > > > > > As it is, I'm not sure we'd even need it - a preiodic userspace > > > FITRIM would acheive the same result, so leaked provisioned spaces > > > would get cleaned up eventually without the filesystem having to do > > > anything specific... > > > > > > > So dm-thinp would _not_ need to have new > > > > state that tracks "provisioned but unused" block? > > > > > > No idea - that's your domain. :) > > > > > > dm-snapshot, for certain, will need to track provisioned regions > > > because it has to guarantee that overwrites to provisioned space in > > > the origin device will always succeed. Hence it needs to know how > > > much space breaking sharing in provisioned regions after a snapshot > > > has been taken with be required... > > > > dm-thinp offers its own much more scalable snapshot support (doesn't > > use old dm-snapshot N-way copyout target). > > > > dm-snapshot isn't going to be modified to support this level of > > hardening (dm-snapshot is basically in "maintenance only" now). Ah, of course. Sorry for the confusion, I was kinda using dm-snapshot as shorthand for "dm-thinp + snapshots". > > But I understand your meaning: what you said is 100% applicable to > > dm-thinp's snapshot implementation and needs to be accounted for in > > thinp's metadata (inherent 'provisioned' flag). *nod* > A bit orthogonal: would dm-thinp need to differentiate between > user-triggered provision requests (eg. from fallocate()) vs > fs-triggered requests? Why? How is the guarantee the block device has to provide to provisioned areas different for user vs filesystem internal provisioned space? > I would lean towards user provisioned areas not > getting dedup'd on snapshot creation, <twitch> Snapshotting is a clone operation, not a dedupe operation. Yes, the end result of both is that you have a block shared between multiple indexes that needs COW on the next overwrite, but the two operations that get to that point are very different... </pedantic mode disegaged> > but that would entail tracking > the state of the original request and possibly a provision request > flag (REQ_PROVISION_DEDUP_ON_SNAPSHOT) or an inverse flag > (REQ_PROVISION_NODEDUP). Possibly too convoluted... Let's not try to add everyone's favourite pony to this interface before we've even got it off the ground. It's the simple precision of the API, the lack of cross-layer communication requirements and the ability to implement and optimise the independent layers independently that makes this a very appealing solution. We need to start with getting the simple stuff working and prove the concept. Then once we can observe the behaviour of a working system we can start working on optimising individual layers for efficiency and performance.... Cheers, Dave.
On Thu, May 25, 2023 at 6:36 PM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, May 25, 2023 at 03:47:21PM -0700, Sarthak Kukreti wrote: > > On Thu, May 25, 2023 at 9:00 AM Mike Snitzer <snitzer@kernel.org> wrote: > > > On Thu, May 25 2023 at 7:39P -0400, > > > Dave Chinner <david@fromorbit.com> wrote: > > > > On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote: > > > > > On Tue, May 23 2023 at 8:40P -0400, > > > > > Dave Chinner <david@fromorbit.com> wrote: > > > > > > It's worth noting that XFS already has a coarse-grained > > > > > > implementation of preferred regions for metadata storage. It will > > > > > > currently not use those metadata-preferred regions for user data > > > > > > unless all the remaining user data space is full. Hence I'm pretty > > > > > > sure that a pre-provisioning enhancment like this can be done > > > > > > entirely in-memory without requiring any new on-disk state to be > > > > > > added. > > > > > > > > > > > > Sure, if we crash and remount, then we might chose a different LBA > > > > > > region for pre-provisioning. But that's not really a huge deal as we > > > > > > could also run an internal background post-mount fstrim operation to > > > > > > remove any unused pre-provisioning that was left over from when the > > > > > > system went down. > > > > > > > > > > This would be the FITRIM with extension you mention below? Which is a > > > > > filesystem interface detail? > > > > > > > > No. We might reuse some of the internal infrastructure we use to > > > > implement FITRIM, but that's about it. It's just something kinda > > > > like FITRIM but with different constraints determined by the > > > > filesystem rather than the user... > > > > > > > > As it is, I'm not sure we'd even need it - a preiodic userspace > > > > FITRIM would acheive the same result, so leaked provisioned spaces > > > > would get cleaned up eventually without the filesystem having to do > > > > anything specific... > > > > > > > > > So dm-thinp would _not_ need to have new > > > > > state that tracks "provisioned but unused" block? > > > > > > > > No idea - that's your domain. :) > > > > > > > > dm-snapshot, for certain, will need to track provisioned regions > > > > because it has to guarantee that overwrites to provisioned space in > > > > the origin device will always succeed. Hence it needs to know how > > > > much space breaking sharing in provisioned regions after a snapshot > > > > has been taken with be required... > > > > > > dm-thinp offers its own much more scalable snapshot support (doesn't > > > use old dm-snapshot N-way copyout target). > > > > > > dm-snapshot isn't going to be modified to support this level of > > > hardening (dm-snapshot is basically in "maintenance only" now). > > Ah, of course. Sorry for the confusion, I was kinda using > dm-snapshot as shorthand for "dm-thinp + snapshots". > > > > But I understand your meaning: what you said is 100% applicable to > > > dm-thinp's snapshot implementation and needs to be accounted for in > > > thinp's metadata (inherent 'provisioned' flag). > > *nod* > > > A bit orthogonal: would dm-thinp need to differentiate between > > user-triggered provision requests (eg. from fallocate()) vs > > fs-triggered requests? > > Why? How is the guarantee the block device has to provide to > provisioned areas different for user vs filesystem internal > provisioned space? > After thinking this through, I stand corrected. I was primarily concerned with how this would balloon thin snapshot sizes if users potentially provision a large chunk of the filesystem but that's putting the cart way before the horse. Best Sarthak > > I would lean towards user provisioned areas not > > getting dedup'd on snapshot creation, > > <twitch> > > Snapshotting is a clone operation, not a dedupe operation. > > Yes, the end result of both is that you have a block shared between > multiple indexes that needs COW on the next overwrite, but the two > operations that get to that point are very different... > > </pedantic mode disegaged> > > > but that would entail tracking > > the state of the original request and possibly a provision request > > flag (REQ_PROVISION_DEDUP_ON_SNAPSHOT) or an inverse flag > > (REQ_PROVISION_NODEDUP). Possibly too convoluted... > > Let's not try to add everyone's favourite pony to this interface > before we've even got it off the ground. > > It's the simple precision of the API, the lack of cross-layer > communication requirements and the ability to implement and optimise > the independent layers independently that makes this a very > appealing solution. > > We need to start with getting the simple stuff working and prove the > concept. Then once we can observe the behaviour of a working system > we can start working on optimising individual layers for efficiency > and performance.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Thu, May 25, 2023 at 12:19:47PM -0400, Brian Foster wrote: > On Wed, May 24, 2023 at 10:40:34AM +1000, Dave Chinner wrote: > > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote: > > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote: > > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote: > > > > ... since I also happen to think there is a potentially interesting > > > > development path to make this sort of reserve pool configurable in terms > > > > of size and active/inactive state, which would allow the fs to use an > > > > emergency pool scheme for managing metadata provisioning and not have to > > > > track and provision individual metadata buffers at all (dealing with > > > > user data is much easier to provision explicitly). So the space > > > > inefficiency thing is potentially just a tradeoff for simplicity, and > > > > filesystems that want more granularity for better behavior could achieve > > > > that with more work. Filesystems that don't would be free to rely on the > > > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC > > > > protection with very minimal changes. > > > > > > > > That's getting too far into the weeds on the future bits, though. This > > > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's > > > > sufficient interest in this sort of "reserve mode" approach to try and > > > > clean it up further and have dm guys look at it, or if you guys see any > > > > obvious issues in what it does that makes it potentially problematic, or > > > > if you would just prefer to go down the path described above... > > > > > > The model that Dave detailed, which builds on REQ_PROVISION and is > > > sticky (by provisioning same blocks for snapshot) seems more useful to > > > me because it is quite precise. That said, it doesn't account for > > > hard requirements that _all_ blocks will always succeed. > > > > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here, > > but I don't think we'd ever need a hard guarantee from the block > > device that every write bio issued from the filesystem will succeed > > without ENOSPC. > > > > The bigger picture goal that I didn't get into in my previous mail is > the "full device" reservation model is intended to be a simple, crude > reference implementation that can be enabled for any arbitrary thin > volume consumer (filesystem or application). The idea is to build that > on a simple enough reservation mechanism that any such consumer could > override it based on its own operational model. The goal is to guarantee > that a particular filesystem never receives -ENOSPC from dm-thin on > writes, but the first phase of implementing that is to simply guarantee > every block is writeable. > > As a specific filesystem is able to more explicitly provision its own > allocations in a way that it can guarantee to return -ENOSPC from > dm-thin up front (rather than at write bio time), it can reduce the need > for the amount of reservation required, ultimately to zero if that > filesystem provides the ability to pre-provision all of its writes to > storage in some way or another. > > I think for filesystems with complex metadata management like XFS, it's > not very realistic to expect explicit 1-1 provisioning for all metadata > changes on a per-transaction basis in the same way that can (fairly > easily) be done for data, which means a pool mechanism is probably still > needed for the metadata class of writes. I'm trying to avoid need for 1-1 provisioning and the need for a accounting based reservation pool approach. I've tried the reservation pool thing several times, and they've all collapsed under the complexity of behaviour guarantees under worst case write amplification situations. The whole point of the LBA provisioning approach is that it completely avoids the need to care about write amplification because the underlying device guarantees any write to a LBA that is provisioned will succeed. It takes care of the write amplification problem for us, and we can make it even easier for the backing device by aligning LBA range provision requests to device region sizes. > > If the block device can provide a guarantee that a provisioned LBA > > range is always writable, then everything else is a filesystem level > > optimisation problem and we don't have to involve the block device > > in any way. All we need is a flag we can ready out of the bdev at > > mount time to determine if the filesystem should be operating with > > LBA provisioning enabled... > > > > e.g. If we need to "pre-provision" a chunk of the LBA space for > > filesystem metadata, we can do that ahead of time and track the > > pre-provisioned range(s) in the filesystem itself. > > > > In XFS, That could be as simple as having small chunks of each AG > > reserved to metadata (e.g. start with the first 100MB) and limiting > > all metadata allocation free space searches to that specific block > > range. When we run low on that space, we pre-provision another 100MB > > chunk and then allocate all metadata out of that new range. If we > > start getting ENOSPC to pre-provisioning, then we reduce the size of > > the regions and log low space warnings to userspace. If we can't > > pre-provision any space at all and we've completely run out, we > > simply declare ENOSPC for all incoming operations that require > > metadata allocation until pre-provisioning succeeds again. > > > > The more interesting aspect of this is not so much how space is > provisioned and allocated, but how the filesystem is going to consume > that space in a way that guarantees -ENOSPC is provided up front before > userspace is allowed to make modifications. Yeah, that's trivial with REQ_PROVISION. If, at transaction reservation time, we don't have enough provisioned metadata space available for the potential allocations we'll need to make, we kick provisioning work off wait for more to come available. If that fails and none is available, we'll get an enospc error right there, same as if the filesystem itself has no blocks available for allocation. This is no different to, say, having xfs_create() fail reservation because ENOSPC, then calling xfs_flush_inodes() to kick off an inode cache walk to trim away all the unused post-eof allocations in memory to free up some space we can use. When that completes, we try the reservation again. There's no new behaviours we need to introduce here - it's just replication of existing behaviours and infrastructure. > You didn't really touch on > that here, so I'm going to assume we'd have something like a perag > counter of how many free blocks currently live in preprovisioned ranges, > and then an fs-wide total somewhere so a transaction has the ability to > consume these blocks at trans reservation time, the fs knows when to > preprovision more space (or go into -ENOSPC mode), etc. Sure, something like that. Those are all implementation details, and not really that complex to implement and is largely replication of reservation infrastructure we already have. > Some accounting of that nature is necessary here in order to prevent the > filesystem from ever writing to unprovisioned space. So what I was > envisioning is rather than explicitly preprovision a physical range of > each AG and tracking all that, just reserve that number of arbitrarily > located blocks from dm for each AG. > > The initial perag reservations can be populated at mount time, > replenished as needed in a very similar way as what you describe, and > 100% released back to the thin pool at unmount time. On top of that, > there's no need to track physical preprovisioned ranges at all. Not just > for allocation purposes, but also to avoid things like having to protect > background trims from preprovisioned ranges of free space dedicated for > metadata, etc. That's all well and good, but reading further down the email the breadth and depth of changes to filesystem and block device behaviour to enable this are ... significant. > > Further, managing shared pool exhaustion doesn't require a > > reservation pool in the backing device and for the filesystems to > > request space from it. Filesystems already have their own reserve > > pools via pre-provisioning. If we want the filesystems to be able to > > release that space back to the shared pool (e.g. because the shared > > backing pool is critically short on space) then all we need is an > > extension to FITRIM to tell the filesystem to also release internal > > pre-provisioned reserves. > > > > Then the backing pool admin (person or automated daemon!) can simply > > issue a trim on all the filesystems in the pool and spce will be > > returned. Then filesystems will ask for new pre-provisioned space > > when they next need to ingest modifications, and the backing pool > > can manage the new pre-provisioning space requests directly.... > > > > This is written as to imply that the reservation pool is some big > complex thing, which makes me think there is some > confusion/miscommunication. No confusion, I'm just sceptical that it will work given my experience trying to implement reservation based solutions multiple different ways over the past decade. They've all failed because they collapse under either the complexity explosion or space overhead required to handle the worst case behavioural scenarios. At one point I calculated the worst case reservation needed ensure log recovery will always succeeded, ignoring write amplification, was about 16x the size of the log. If I took write amplification for dm-thinp having 64kB blocks and each inode hitting a different cluster in it's own dm thinp block, that write amplication hit 64x. So for recovering a 2GB log, if dm-thinp doesn't have a reserve of well over 100GB of pool space, there is no guarantee that log recovery will -always- succeed. It's worst case numbers like this which made me conclude that reservation based approaches cannot provide guarantees that ENOSPC will never occur. The numbers are just too large when you start considering journals that can hold a million dirty objects, intent chains that might require modifying hundreds of metadata blocks across a dozen transactions before they complete, etc. OTOH, REQ_PROVISION makes this "log recovery needs new space to be allocated" problem go away entirely. It provides a mechanism that ensures log recovery does not consume any new space in the backing pool as all the overwrites it performs are to previously provisioned metadata..... This is just one of the many reasons why I think the REQ_PROVISION method is far better than reservations - it solves problems that pure runtime reservations can't. > It's basically just an in memory counter of > space that is allocated out of a shared thin pool and is held in a > specific thin volume while it is currently in use. The counter on the > volume is managed indirectly by filesystem requests and/or direct > operations on the volume (like dm snapshots). > > Sure, you could replace the counter and reservation interface with > explicitly provisioned/trimmed LBA ranges that the fs can manage to > provide -ENOSPC guarantees, but then the fs has to do those various > things you've mentioned: > > - Provision those ranges in the fs and change allocation behavior > accordingly. This is relatively simple - most of the allocator functionality is already there. > - Do the background post-crash fitrim preprovision clean up thing. We've already decided this is not needed. > - Distinguish between trims that are intended to return preprovisioned > space vs. those that come from userspace. It's about ten lines of code in xfs_trim_extents() to do this. i.e. the free space tree walk simply skips over free extents in the metadata provisioned region based on a flag value. > - Have some daemon or whatever (?) responsible for communicating the > need for trims in the fs to return space back to the pool. Systems are already configured to run a periodic fstrim passes to do this via systemd units. And I'm pretty sure dm-thinp has a low space notification to userspace (via dbus?) that is already used by userspace agents to handle "near ENOSPC" events automatically. > Then this still depends on changing how dm thin snapshots work and needs > a way to deal with delayed allocation to actually guarantee -ENOSPC > protection..? I think you misunderstand: I'm not proposing to use REQ_PROVISION for writes the filesystem does not guarantee will succeed. Never have, I think it makes no sense at all. If the filesystem can return ENOSPC for an unprovisioned user data write, then the block device can too. > > Hence I think if we get the basic REQ_PROVISION overwrite-in-place > > guarantees defined and implemented as previously outlined, then we > > don't need any special coordination between the fs and block devices > > to avoid fatal ENOSPC issues with sparse and/or snapshot capable > > block devices... > > > > This all sounds like a good amount of coordination and unnecessary > complexity to me. What I was thinking as a next phase (i.e. after > initial phase full device reservation) approach for a filesystem like > XFS would be something like this. > > - Support a mount option for a configurable size metadata reservation > pool (with sane/conservative default). I want this to all to work without the user having be aware that there filesystem is running on a sparse device. > - The pool is populated at mount time, else the fs goes right into > simulated -ENOSPC mode. What are the rules of this mode? Hmmmm. Log recovery needs to be able to allocate new metadata (i.e. in intent replay), so I'm guessing reservation is needed before log recovery? But if pool reservation fails, how do we then safely perform log recovery given the filesystem is in ENOSPC mode? > - Thin pool reservation consumption is controlled by a flag on write > bios that is managed by the fs (flag polarity TBD). So we still need a bio flag to communicate "this IO consumes reservation". What are the semantics of this flag? What happens on submission error? e.g. the bio is failed before it gets to the layer that consumes it - how does the filesystem know that reservation was consumed or not at completion? How do we know when to set it for user data writes? What happens if the device recieves a bio with this flag but there is no reservation remaining? e.g. the filesystem or device accounting have got out of whack? Hmmm. On that note, what about write amplification? Or should I call it "reservation amplification". i.e. a 4kB bio with a "consume reservation" flag might trigger a dm-region COW or allocation and require 512kB of dm-thinp pool space to be allocated. How much reservation actually gets consumed, and how do we reconcile the differences in physical consumption vs reservation consumption? > - All fs data writes are explicitly reserved up front in the write path. > Delalloc maps to explicit reservation, overwrites are easy and just > involve an explicit provision. This is the first you've mentioned an "explicit provision" operation. Is this like REQ_PROVISION, or something else? This seems to imply that the ->iomap_begin method has to do explicit provisioning callouts when we get a write that lands in an IOMAP_MAPPED extent? Or something else? Can you describe this mechanism in more detail? > - Metadata writes are not reserved or provisioned at all. They allocate > out of the thin pool on write (if needed), just as they do today. On > an -ENOSPC metadata write error, the fs goes into simulated -ENOSPC mode > and allows outstanding metadata writes to now use the bio flag to > consume emergency reservation. Okay. We need two pools in the backing device? The normal free space pool, and an emergency reservation pool? Without reading further, this implies that the filesystem is reliant on the emergency reservation pool being large enough that it can write any dirty metadata it has outstanding without ENOSPC occuring. How does the size of this emergency pool get configured? > So this means that metadata -ENOSPC protection is only as reliable as > the size of the specified pool. This is by design, so the filesystem > still does not have to track provisioning, allocation or overwrites of > its own metadata usage. Users with metadata heavy workloads or who > happen to be sensitive to -ENOSPC errors can be more aggressive with > pool size, while other users might be able to get away with a smaller > pool. Users who are super paranoid and want perfection can continue to > reserve the entire device and pay for the extra storage. Oh. Hand tuning. :( > Users who are not sure can test their workload in an appropriate > environment, collect some data/metrics on maximum outstanding dirty > metadata, and then use that as a baseline/minimum pool size for reliable > behavior going forward. This is also where something like Stratis can > come in to generate this sort of information, make recommendations or > implement heuristics (based on things like fs size, amount of RAM, for > e.g.) to provide sane defaults based on use case. I.e., this is > initially exposed as a userspace/tuning issue instead of a > filesystem/dm-thin hard guarantee. Which are the same things people have been complaining about for years. > Finally, if you really want to get to that last step of maximally > efficient and safe provisioning in the fs, implement a > 'thinreserve=adaptive' mode in the fs that alters the acquisition and > consumption of dm-thin reserved blocks to be adaptive in nature and > promises to do it's own usage throttling against outstanding > reservation. I think this is the mode that most closely resembles your > preprovisioned range mechanism. > > For example, adaptive mode could add the logic/complexity where you do > the per-ag provision thing (just using reservation instead of physical > ranges), change the transaction path to attempt to increase the > reservation pool or go into -ENOSPC mode, and flag all writes to be > satisfied from the reserve pool (because you've done the > provision/reservation up front). Ok, so why not just go straight to this model using REQ_PROVISION? If we then want to move to a different "accounting only" model for provisioning, we just change REQ_PROVISION? But I still see the problem of write amplification accounting being unsolved by the "filesystem accounting only" approach advocated here. We have no idea when the backing device has snapshots taken, we have no idea when a filesystem write IO actually consumes more thinp blocks than filesystem blocks, etc. How does the filesystem level reservation pool address these problems? > Thoughts on any of the above? I'd say it went wrong at the requirements stage, resulting in an overly complex, over-engineered solution. > One general tradeoff with using reservations vs. preprovisioning is the > the latter can just use the provision/trim primitives to alloc/free LBA > ranges. My thought on that is those primitives could possibly be > modified to do the same sort of things with reservation as for physical > allocations. That seems fairly easy to do with bio op flags/modifiers, > though one thing I'm not sure about is how to submit a provision bio to > request a certain amount location agnostic blocks. I'd have to > investigate that more. Sure, if the constrained LBA space aspect of the REQ_PROVISION implementation causes issues, then we see if we can optimise away the fixed LBA space requirement.
On Fri, May 26, 2023 at 07:37:43PM +1000, Dave Chinner wrote: > On Thu, May 25, 2023 at 12:19:47PM -0400, Brian Foster wrote: > > On Wed, May 24, 2023 at 10:40:34AM +1000, Dave Chinner wrote: > > > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote: > > > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote: > > > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote: > > > > > ... since I also happen to think there is a potentially interesting > > > > > development path to make this sort of reserve pool configurable in terms > > > > > of size and active/inactive state, which would allow the fs to use an > > > > > emergency pool scheme for managing metadata provisioning and not have to > > > > > track and provision individual metadata buffers at all (dealing with > > > > > user data is much easier to provision explicitly). So the space > > > > > inefficiency thing is potentially just a tradeoff for simplicity, and > > > > > filesystems that want more granularity for better behavior could achieve > > > > > that with more work. Filesystems that don't would be free to rely on the > > > > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC > > > > > protection with very minimal changes. > > > > > > > > > > That's getting too far into the weeds on the future bits, though. This > > > > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's > > > > > sufficient interest in this sort of "reserve mode" approach to try and > > > > > clean it up further and have dm guys look at it, or if you guys see any > > > > > obvious issues in what it does that makes it potentially problematic, or > > > > > if you would just prefer to go down the path described above... > > > > > > > > The model that Dave detailed, which builds on REQ_PROVISION and is > > > > sticky (by provisioning same blocks for snapshot) seems more useful to > > > > me because it is quite precise. That said, it doesn't account for > > > > hard requirements that _all_ blocks will always succeed. > > > > > > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here, > > > but I don't think we'd ever need a hard guarantee from the block > > > device that every write bio issued from the filesystem will succeed > > > without ENOSPC. > > > > > > > The bigger picture goal that I didn't get into in my previous mail is > > the "full device" reservation model is intended to be a simple, crude > > reference implementation that can be enabled for any arbitrary thin > > volume consumer (filesystem or application). The idea is to build that > > on a simple enough reservation mechanism that any such consumer could > > override it based on its own operational model. The goal is to guarantee > > that a particular filesystem never receives -ENOSPC from dm-thin on > > writes, but the first phase of implementing that is to simply guarantee > > every block is writeable. > > > > As a specific filesystem is able to more explicitly provision its own > > allocations in a way that it can guarantee to return -ENOSPC from > > dm-thin up front (rather than at write bio time), it can reduce the need > > for the amount of reservation required, ultimately to zero if that > > filesystem provides the ability to pre-provision all of its writes to > > storage in some way or another. > > > > I think for filesystems with complex metadata management like XFS, it's > > not very realistic to expect explicit 1-1 provisioning for all metadata > > changes on a per-transaction basis in the same way that can (fairly > > easily) be done for data, which means a pool mechanism is probably still > > needed for the metadata class of writes. > > I'm trying to avoid need for 1-1 provisioning and the need for a > accounting based reservation pool approach. I've tried the > reservation pool thing several times, and they've all collapsed > under the complexity of behaviour guarantees under worst case write > amplification situations. > > The whole point of the LBA provisioning approach is that it > completely avoids the need to care about write amplification because > the underlying device guarantees any write to a LBA that is > provisioned will succeed. It takes care of the write amplification > problem for us, and we can make it even easier for the backing > device by aligning LBA range provision requests to device region > sizes. > > > > If the block device can provide a guarantee that a provisioned LBA > > > range is always writable, then everything else is a filesystem level > > > optimisation problem and we don't have to involve the block device > > > in any way. All we need is a flag we can ready out of the bdev at > > > mount time to determine if the filesystem should be operating with > > > LBA provisioning enabled... > > > > > > e.g. If we need to "pre-provision" a chunk of the LBA space for > > > filesystem metadata, we can do that ahead of time and track the > > > pre-provisioned range(s) in the filesystem itself. > > > > > > In XFS, That could be as simple as having small chunks of each AG > > > reserved to metadata (e.g. start with the first 100MB) and limiting > > > all metadata allocation free space searches to that specific block > > > range. When we run low on that space, we pre-provision another 100MB > > > chunk and then allocate all metadata out of that new range. If we > > > start getting ENOSPC to pre-provisioning, then we reduce the size of > > > the regions and log low space warnings to userspace. If we can't > > > pre-provision any space at all and we've completely run out, we > > > simply declare ENOSPC for all incoming operations that require > > > metadata allocation until pre-provisioning succeeds again. > > > > > > > The more interesting aspect of this is not so much how space is > > provisioned and allocated, but how the filesystem is going to consume > > that space in a way that guarantees -ENOSPC is provided up front before > > userspace is allowed to make modifications. > > Yeah, that's trivial with REQ_PROVISION. > > If, at transaction reservation time, we don't have enough > provisioned metadata space available for the potential allocations > we'll need to make, we kick provisioning work off wait for more to > come available. If that fails and none is available, we'll get an > enospc error right there, same as if the filesystem itself has no > blocks available for allocation. > > This is no different to, say, having xfs_create() fail reservation > because ENOSPC, then calling xfs_flush_inodes() to kick off an inode > cache walk to trim away all the unused post-eof allocations in > memory to free up some space we can use. When that completes, > we try the reservation again. > > There's no new behaviours we need to introduce here - it's just > replication of existing behaviours and infrastructure. > Yes, this is just context. What I'm trying to say is the semantics for this aspect would be the same irrespective of "guaranteed writeable space" being implemented as a reservation or preprovisioned LBA range. I.e., it's a limited resource that has to be managed in a way to provide specific user visible behavior. > > You didn't really touch on > > that here, so I'm going to assume we'd have something like a perag > > counter of how many free blocks currently live in preprovisioned ranges, > > and then an fs-wide total somewhere so a transaction has the ability to > > consume these blocks at trans reservation time, the fs knows when to > > preprovision more space (or go into -ENOSPC mode), etc. > > Sure, something like that. Those are all implementation details, and > not really that complex to implement and is largely replication of > reservation infrastructure we already have. > Ack. > > Some accounting of that nature is necessary here in order to prevent the > > filesystem from ever writing to unprovisioned space. So what I was > > envisioning is rather than explicitly preprovision a physical range of > > each AG and tracking all that, just reserve that number of arbitrarily > > located blocks from dm for each AG. > > > > The initial perag reservations can be populated at mount time, > > replenished as needed in a very similar way as what you describe, and > > 100% released back to the thin pool at unmount time. On top of that, > > there's no need to track physical preprovisioned ranges at all. Not just > > for allocation purposes, but also to avoid things like having to protect > > background trims from preprovisioned ranges of free space dedicated for > > metadata, etc. > > That's all well and good, but reading further down the email the > breadth and depth of changes to filesystem and block device > behaviour to enable this are ... significant. > > > > Further, managing shared pool exhaustion doesn't require a > > > reservation pool in the backing device and for the filesystems to > > > request space from it. Filesystems already have their own reserve > > > pools via pre-provisioning. If we want the filesystems to be able to > > > release that space back to the shared pool (e.g. because the shared > > > backing pool is critically short on space) then all we need is an > > > extension to FITRIM to tell the filesystem to also release internal > > > pre-provisioned reserves. > > > > > > Then the backing pool admin (person or automated daemon!) can simply > > > issue a trim on all the filesystems in the pool and spce will be > > > returned. Then filesystems will ask for new pre-provisioned space > > > when they next need to ingest modifications, and the backing pool > > > can manage the new pre-provisioning space requests directly.... > > > > > > > This is written as to imply that the reservation pool is some big > > complex thing, which makes me think there is some > > confusion/miscommunication. > > No confusion, I'm just sceptical that it will work given my > experience trying to implement reservation based solutions multiple > different ways over the past decade. They've all failed because > they collapse under either the complexity explosion or space > overhead required to handle the worst case behavioural scenarios. > > At one point I calculated the worst case reservation needed ensure > log recovery will always succeeded, ignoring write amplification, > was about 16x the size of the log. If I took write amplification for > dm-thinp having 64kB blocks and each inode hitting a different > cluster in it's own dm thinp block, that write amplication hit 64x. > Ok. Can you give some examples of operations that lead to this worst case behavior? It sounds like you're talking about inode chunk intent initialization or some such, but I'd like to be sure I understand. > So for recovering a 2GB log, if dm-thinp doesn't have a reserve of > well over 100GB of pool space, there is no guarantee that log > recovery will -always- succeed. > > It's worst case numbers like this which made me conclude that > reservation based approaches cannot provide guarantees that ENOSPC > will never occur. The numbers are just too large when you start > considering journals that can hold a million dirty objects, > intent chains that might require modifying hundreds of metadata > blocks across a dozen transactions before they complete, etc. > > OTOH, REQ_PROVISION makes this "log recovery needs new space to be > allocated" problem go away entirely. It provides a mechanism that > ensures log recovery does not consume any new space in the backing > pool as all the overwrites it performs are to previously provisioned > metadata..... > Ah, I see. So this relies on the change in behavior to dm-thin snapshots to preserve overwriteably of previously provisioned metadata to indirectly manage log recovery -> metadata write amplification. This is useful context and helps understand the intent of that suggestion. I also think it calls out some of the disconnect.. > This is just one of the many reasons why I think the REQ_PROVISION > method is far better than reservations - it solves problems that > pure runtime reservations can't. > > > It's basically just an in memory counter of > > space that is allocated out of a shared thin pool and is held in a > > specific thin volume while it is currently in use. The counter on the > > volume is managed indirectly by filesystem requests and/or direct > > operations on the volume (like dm snapshots). > > > > Sure, you could replace the counter and reservation interface with > > explicitly provisioned/trimmed LBA ranges that the fs can manage to > > provide -ENOSPC guarantees, but then the fs has to do those various > > things you've mentioned: > > > > - Provision those ranges in the fs and change allocation behavior > > accordingly. > > This is relatively simple - most of the allocator functionality is > already there. > > > - Do the background post-crash fitrim preprovision clean up thing. > > We've already decided this is not needed. > > > - Distinguish between trims that are intended to return preprovisioned > > space vs. those that come from userspace. > > It's about ten lines of code in xfs_trim_extents() to do this. i.e. > the free space tree walk simply skips over free extents in the > metadata provisioned region based on a flag value. > > > - Have some daemon or whatever (?) responsible for communicating the > > need for trims in the fs to return space back to the pool. > > Systems are already configured to run a periodic fstrim passes to do > this via systemd units. And I'm pretty sure dm-thinp has a low space > notification to userspace (via dbus?) that is already used by > userspace agents to handle "near ENOSPC" events automatically. > Yeah, Ok. To be clear, I'm not trying to suggest any of these particular things are complex or the whole thing is intractable or anything like that. I'm pretty sure I understand how this can all be made to work, at least for metadata. But I am saying this makes a bunch of customization changes that could lead to a very XFS centric approach, may be more work than necessary, and hasn't been described in a way that explains how it actually (or helps) solves the broader -ENOSPC problem. > > Then this still depends on changing how dm thin snapshots work and needs > > a way to deal with delayed allocation to actually guarantee -ENOSPC > > protection..? > > I think you misunderstand: I'm not proposing to use REQ_PROVISION > for writes the filesystem does not guarantee will succeed. Never > have, I think it makes no sense at all. If the filesystem > can return ENOSPC for an unprovisioned user data write, then the > block device can too. > Well, yes.. that's why I was asking. :) I still don't parse what you're saying here. Is this intended to prevent -ENOSPC from dm-thin for data writes, or is that an exercise for the reader? > > > Hence I think if we get the basic REQ_PROVISION overwrite-in-place > > > guarantees defined and implemented as previously outlined, then we > > > don't need any special coordination between the fs and block devices > > > to avoid fatal ENOSPC issues with sparse and/or snapshot capable > > > block devices... > > > > > > > This all sounds like a good amount of coordination and unnecessary > > complexity to me. What I was thinking as a next phase (i.e. after > > initial phase full device reservation) approach for a filesystem like > > XFS would be something like this. > > > > - Support a mount option for a configurable size metadata reservation > > pool (with sane/conservative default). > > I want this to all to work without the user having be aware that > there filesystem is running on a sparse device. > Reservation (or provision) support can certainly be autodetected. > > - The pool is populated at mount time, else the fs goes right into > > simulated -ENOSPC mode. > > What are the rules of this mode? > Earlier you mentioned that the filesystem would "declare -ENOSPC" when preprovisioning starts to fail. The rules here would be exactly the same. > Hmmmm. > > Log recovery needs to be able to allocate new metadata (i.e. in > intent replay), so I'm guessing reservation is needed before log > recovery? But if pool reservation fails, how do we then safely > perform log recovery given the filesystem is in ENOSPC mode? > > > - Thin pool reservation consumption is controlled by a flag on write > > bios that is managed by the fs (flag polarity TBD). > > So we still need a bio flag to communicate "this IO consumes > reservation". > > What are the semantics of this flag? What happens on submission > error? e.g. the bio is failed before it gets to the layer that > consumes it - how does the filesystem know that reservation was > consumed or not at completion? > The semantics are to use reservation or not. If so, then it's implied reservation exists and if not that's a bug. If reservation is not enabled, then dm-thin processes those writes exactly as it does today (allocates out of the pool when necessary, fails otherwise). > How do we know when to set it for user data writes? > User data is always reserved before writes are accepted, so reservation is always enabled for user data write bios. > What happens if the device recieves a bio with this flag but there > is no reservation remaining? e.g. the filesystem or device > accounting have got out of whack? > That's a bug. Almost the same as if the filesystem were to allow a delalloc write that can't ultimately allocate because in-core counters become inconsistent with free space btrees (not like that hasn't happened before ;). Let's not get too into the weeds here as if to imply coding errors translate to design flaws. I'd say it probably should warn, fallback to normal pool allocation. > Hmmm. On that note, what about write amplification? Or should I call > it "reservation amplification". i.e. a 4kB bio with a "consume > reservation" flag might trigger a dm-region COW or allocation and > require 512kB of dm-thinp pool space to be allocated. How much > reservation actually gets consumed, and how do we reconcile the > differences in physical consumption vs reservation consumption? > The filesystem has no concept of the amount of reservation. Only whether outstanding writes have been reserved or not. In this specific reference implementation, all data writes are reserved and metadata writes are not (until an -ENOSPC error happens and the fs decides to "declare -ENOSPC" based on underlying volume state). > > - All fs data writes are explicitly reserved up front in the write path. > > Delalloc maps to explicit reservation, overwrites are easy and just > > involve an explicit provision. > > This is the first you've mentioned an "explicit provision" > operation. Is this like REQ_PROVISION, or something else? > > This seems to imply that the ->iomap_begin method has to do > explicit provisioning callouts when we get a write that lands in an > IOMAP_MAPPED extent? Or something else? > > Can you describe this mechanism in more detail? > So something I've hacked up from the older prototype is to essentially implement a simple form of a REQ_PROVISION|REQ_RESERVE type operation. You can think of it like REQ_PROVISION as implemented by this series, but it doesn't actually do the COW breaking and allocation and whatnot right away. Instead, it reserves however many blocks out of the pool might be required to guarantee subsequent writes to the specified region are guaranteed not to fail with -ENOSPC. (Note that the prototype isn't currently using REQ_PROVISION. It's just a function call at the moment. I'm just explaining the concept.) So the idea for user data in general is something like: iomap looks up an extent, does a "reserve provision" over it based on the size of the write, etc. If that succeeds, then the write can proceed to dirty pages with a guarantee that dm-thin will not -ENOSPC at writeback time. If the extent is a hole, then delalloc translates to location agnostic reservation that is eventually translated to a "reserve provision" at filesystem allocation time. Note that this does introduce an aspect of reservation amplification due to block size differences, but this was already addressed by the older prototype. The same 'flush inodes on -ENOSPC' mechanism you refer to above provides a feedback mechanism to allow outstanding reservations to flush and prevent any premature error problems. And that can be optimized further in various ways. For example, to simply map outstanding delalloc extents in the fs and do the "reserve provision" across the ultimate LBA ranges to release overprovisioned reserves, while still deferring writeback to later. A shrinker could be used to allow the thin pool to signal lower space conditions to active volumes to smooth out behavior rather than waiting for an actual -ENOSPC, etc. etc. > > - Metadata writes are not reserved or provisioned at all. They allocate > > out of the thin pool on write (if needed), just as they do today. On > > an -ENOSPC metadata write error, the fs goes into simulated -ENOSPC mode > > and allows outstanding metadata writes to now use the bio flag to > > consume emergency reservation. > > Okay. We need two pools in the backing device? The normal free space > pool, and an emergency reservation pool? > Err not exactly.. it's really just selective use of the bio flag that allows reserve consumption. Always enabled on data, never on metadata, until -ENOSPC error and the fs decides to open reserves for metadata and attempt to allow a full quiesce. > Without reading further, this implies that the filesystem is > reliant on the emergency reservation pool being large enough that > it can write any dirty metadata it has outstanding without ENOSPC > occuring. How does the size of this emergency pool get configured? > > > So this means that metadata -ENOSPC protection is only as reliable as > > the size of the specified pool. This is by design, so the filesystem > > still does not have to track provisioning, allocation or overwrites of > > its own metadata usage. Users with metadata heavy workloads or who > > happen to be sensitive to -ENOSPC errors can be more aggressive with > > pool size, while other users might be able to get away with a smaller > > pool. Users who are super paranoid and want perfection can continue to > > reserve the entire device and pay for the extra storage. > > Oh. Hand tuning. :( > Yes and no.. from the fs perspective it's hand tuning. From a user perspective it can be if desired I guess, but really intended to be done by the management software that already exists with intent to help manage this problem. To put another way, any complex user of thin provisioning who is concerned about this problem is already doing some degree of tuning here to try and prevent it. Also, an alternative to what I describe above could be for a filesystem to implement thinreserve=N mode with a throttle to best ensure -ENOSPC reliability at the cost of performance. It's still a tunable, but maybe easier to turn into a heuristic. Not sure, just a random thought. > > Users who are not sure can test their workload in an appropriate > > environment, collect some data/metrics on maximum outstanding dirty > > metadata, and then use that as a baseline/minimum pool size for reliable > > behavior going forward. This is also where something like Stratis can > > come in to generate this sort of information, make recommendations or > > implement heuristics (based on things like fs size, amount of RAM, for > > e.g.) to provide sane defaults based on use case. I.e., this is > > initially exposed as a userspace/tuning issue instead of a > > filesystem/dm-thin hard guarantee. > > Which are the same things people have been complaining about for years. > Priorities and progress. I don't think that because this isn't the absolute perfect, most easily usable, completely efficient solution right off the bat is a very good argument to imply it's not a feasible approach in general. This is why I'm trying to describe the intended progression here. Users are already dealing with this sort of thing through Stratis, and this is a step to at least try to make things better. And really, the same could be said for preprovisioning until/unless it's able to fully guarantee prevention of -ENOSPC errors for data and metadata. That is exactly the same sort of thing "people have been complaining about for years." > > Finally, if you really want to get to that last step of maximally > > efficient and safe provisioning in the fs, implement a > > 'thinreserve=adaptive' mode in the fs that alters the acquisition and > > consumption of dm-thin reserved blocks to be adaptive in nature and > > promises to do it's own usage throttling against outstanding > > reservation. I think this is the mode that most closely resembles your > > preprovisioned range mechanism. > > > > For example, adaptive mode could add the logic/complexity where you do > > the per-ag provision thing (just using reservation instead of physical > > ranges), change the transaction path to attempt to increase the > > reservation pool or go into -ENOSPC mode, and flag all writes to be > > satisfied from the reserve pool (because you've done the > > provision/reservation up front). > > Ok, so why not just go straight to this model using REQ_PROVISION? > A couple reasons.. one is I've also been trying to hack around this problem for a while and have yet to see a full solution that can't either be completely broken or just driven to impracticality performance or space consumption wise. We had an offline discussion with some of the Stratis and dm folks fairly recently where they explained what Stratis is doing now to mitigate this. Almost everybody agrees this approach stinks, which is why I expected a similar first reaction to my "phase 1 full reservation" model. The thought process there, however, is that rather than continue to try and hack up various invasive solutions to provide such a simple user visible behavior and ultimately not making progress, why not take what they're doing and users are apparently already using and work to make it better? IOW, when thinking about the prospect of "hand tuning" above, I think the more appropriate way to look at it is not that we're providing some full end-user solution right off the bat. Instead, we're taking Stratis' already existing "no overprovision" mode and starting to improve on it. Step one lifts it into the kernel to make it dynamic (runtime reservation vs provision time no overprovision enforcement), next steps focus on making it more efficient while preserving the -ENOSPC safety guarantee. No end user really needs to interact with it directly until/unless filesystems grow the super-smart ability to do everything automagically. So it could very well be that this all remains an experimental feature until "adaptive mode" can be made to work, but at least we have the ability to solve the problem incrementally and without permanent changes. The ability to just rip it out without having made any permanent changes to filesystems or thin pool metadata should it just happen to fail spectatularly is also a feature IMO. It could also be the case that the simple sizing mechanism works well enough, and Stratis is able to make good enough recommendations that most users are satisifed, and there's really no need for the levels of complexity we're talking about for adaptive mode (or preprovisioning) at all. > If we then want to move to a different "accounting only" model for > provisioning, we just change REQ_PROVISION? > AFAIU this relies on some permanent changes to dm that are not necessarily trivial to undo..? If so, I think it's actually wiser to move in the opposite direction. If reservation proves too broad and expensive due to things like amplification, then move toward physical provisioning and permanent snapshot changes to address that problem. The benefit of this is that the reservation approach solves the fundamental problem from the start, even if the implementation is ultimately too impractical to be useful, so this mitigates the risk of getting too far down the road with permanent changes to disk formats and whatnot only to find the solution doesn't ultimately work. > But I still see the problem of write amplification accounting being > unsolved by the "filesystem accounting only" approach advocated > here. We have no idea when the backing device has snapshots taken, > we have no idea when a filesystem write IO actually consumes more > thinp blocks than filesystem blocks, etc. How does the filesystem > level reservation pool address these problems? > This is a fair concern in general, but as mentioned above, I think still highlights a misunderstanding with the reserve metadata pool idea. The key point I think is that metadata writes are not actually reserved. The reserve pool exists solely to handle the -ENOSPC mode transition, not to continuously supply ongoing metadata transactions. This means write amplification is less of a concern: every successful FSB sized write allocates DMB (device mapper block) sized blocks out of the thin pool, further reducing the odds that subsequent writes to any overlapping FSB blocks will ever require reservation for the current active cycle of the log. A snapshot could happen at any point, but dm-thin snapshots already call into freeze infrastructure, which already quiesces the log. After that point the game starts all over, all overwrites require allocation, and the pool has to be sized large enough to acommodate that the filesystem is able to quiesce in the event of -ENOSPC, particularly if snapshots are in use. So I'm not saying write amplification is not a problem. I think things like crashing a filesystem, doing a snapshot, then running recovery could exacerbate this problem, for example. But that's another corner case that I don't necessarily think discredits the idea. For example, if XFS really wanted to, it could add another pass to log recovery to do "reserve provisions" on affected metadata before recovering anything, or just scan and reserve provision the entire metadata allocated portion of the fs, or refuse to proceed and require full reservation for any time the log is dirty, etc. etc. I think this is something that could use some examples (re: my earlier question) to help work through whether the pool approach is sane, or if the size would just always be too big. If not, you could still decide that the configurable pool approach just doesn't work at all for XFS, but track the outstanding metadata block usage somewhere (or estimate based on existing ag buffer btree block counters), reserve that number of blocks at mount time, and use that to guarantee all metadata block overwrites will always succeed in the exact same way a preprovision scheme would. A snapshot while mounted would bump the volume side reservation appropriately or fail. I suspect that trades off mount time performance for better snapshot behavior, but again is just another handwavy option on the table for consideration that doesn't preclude other fs' from possibly doing something more simple. Brian > > Thoughts on any of the above? > > I'd say it went wrong at the requirements stage, resulting in an > overly complex, over-engineered solution. > > > One general tradeoff with using reservations vs. preprovisioning is the > > the latter can just use the provision/trim primitives to alloc/free LBA > > ranges. My thought on that is those primitives could possibly be > > modified to do the same sort of things with reservation as for physical > > allocations. That seems fairly easy to do with bio op flags/modifiers, > > though one thing I'm not sure about is how to submit a provision bio to > > request a certain amount location agnostic blocks. I'd have to > > investigate that more. > > Sure, if the constrained LBA space aspect of the REQ_PROVISION > implementation causes issues, then we see if we can optimise away > the fixed LBA space requirement. > > > -- > Dave Chinner > david@fromorbit.com >
On Thu, May 25, 2023 at 07:35:14PM -0700, Sarthak Kukreti wrote: > On Thu, May 25, 2023 at 6:36 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Thu, May 25, 2023 at 03:47:21PM -0700, Sarthak Kukreti wrote: > > > On Thu, May 25, 2023 at 9:00 AM Mike Snitzer <snitzer@kernel.org> wrote: > > > > On Thu, May 25 2023 at 7:39P -0400, > > > > Dave Chinner <david@fromorbit.com> wrote: > > > > > On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote: > > > > > > On Tue, May 23 2023 at 8:40P -0400, > > > > > > Dave Chinner <david@fromorbit.com> wrote: > > > > > > > It's worth noting that XFS already has a coarse-grained > > > > > > > implementation of preferred regions for metadata storage. It will > > > > > > > currently not use those metadata-preferred regions for user data > > > > > > > unless all the remaining user data space is full. Hence I'm pretty > > > > > > > sure that a pre-provisioning enhancment like this can be done > > > > > > > entirely in-memory without requiring any new on-disk state to be > > > > > > > added. > > > > > > > > > > > > > > Sure, if we crash and remount, then we might chose a different LBA > > > > > > > region for pre-provisioning. But that's not really a huge deal as we > > > > > > > could also run an internal background post-mount fstrim operation to > > > > > > > remove any unused pre-provisioning that was left over from when the > > > > > > > system went down. > > > > > > > > > > > > This would be the FITRIM with extension you mention below? Which is a > > > > > > filesystem interface detail? > > > > > > > > > > No. We might reuse some of the internal infrastructure we use to > > > > > implement FITRIM, but that's about it. It's just something kinda > > > > > like FITRIM but with different constraints determined by the > > > > > filesystem rather than the user... > > > > > > > > > > As it is, I'm not sure we'd even need it - a preiodic userspace > > > > > FITRIM would acheive the same result, so leaked provisioned spaces > > > > > would get cleaned up eventually without the filesystem having to do > > > > > anything specific... > > > > > > > > > > > So dm-thinp would _not_ need to have new > > > > > > state that tracks "provisioned but unused" block? > > > > > > > > > > No idea - that's your domain. :) > > > > > > > > > > dm-snapshot, for certain, will need to track provisioned regions > > > > > because it has to guarantee that overwrites to provisioned space in > > > > > the origin device will always succeed. Hence it needs to know how > > > > > much space breaking sharing in provisioned regions after a snapshot > > > > > has been taken with be required... > > > > > > > > dm-thinp offers its own much more scalable snapshot support (doesn't > > > > use old dm-snapshot N-way copyout target). > > > > > > > > dm-snapshot isn't going to be modified to support this level of > > > > hardening (dm-snapshot is basically in "maintenance only" now). > > > > Ah, of course. Sorry for the confusion, I was kinda using > > dm-snapshot as shorthand for "dm-thinp + snapshots". > > > > > > But I understand your meaning: what you said is 100% applicable to > > > > dm-thinp's snapshot implementation and needs to be accounted for in > > > > thinp's metadata (inherent 'provisioned' flag). > > > > *nod* > > > > > A bit orthogonal: would dm-thinp need to differentiate between > > > user-triggered provision requests (eg. from fallocate()) vs > > > fs-triggered requests? > > > > Why? How is the guarantee the block device has to provide to > > provisioned areas different for user vs filesystem internal > > provisioned space? > > > After thinking this through, I stand corrected. I was primarily > concerned with how this would balloon thin snapshot sizes if users > potentially provision a large chunk of the filesystem but that's > putting the cart way before the horse. > I think that's a legitimate concern. At some point to provide full -ENOSPC protection the filesystem needs to provision space before it writes to it, whether it be data or metadata, right? At what point does that turn into a case where pretty much everything the fs wrote is provisioned, and therefore a snapshot is just a full copy operation? That might be Ok I guess, but if that's an eventuality then what's the need to track provision state at dm-thin block level? Using some kind of flag you mention below could be a good way to qualify which blocks you'd want to copy vs. which to share on snapshot and perhaps mitigate that problem. > Best > Sarthak > > > > I would lean towards user provisioned areas not > > > getting dedup'd on snapshot creation, > > > > <twitch> > > > > Snapshotting is a clone operation, not a dedupe operation. > > > > Yes, the end result of both is that you have a block shared between > > multiple indexes that needs COW on the next overwrite, but the two > > operations that get to that point are very different... > > > > </pedantic mode disegaged> > > > > > but that would entail tracking > > > the state of the original request and possibly a provision request > > > flag (REQ_PROVISION_DEDUP_ON_SNAPSHOT) or an inverse flag > > > (REQ_PROVISION_NODEDUP). Possibly too convoluted... > > > > Let's not try to add everyone's favourite pony to this interface > > before we've even got it off the ground. > > > > It's the simple precision of the API, the lack of cross-layer > > communication requirements and the ability to implement and optimise > > the independent layers independently that makes this a very > > appealing solution. > > > > We need to start with getting the simple stuff working and prove the > > concept. Then once we can observe the behaviour of a working system > > we can start working on optimising individual layers for efficiency > > and performance.... > > I think to prove the concept may not necessarily require changes to dm-thin at all. If you want to guarantee preexisting metadata block writeability, just scan through and provision all metadata blocks at mount time. Hit the log, AG bufs, IIRC XFS already has btree walking code that can be used for btrees and associated metadata, etc. Maybe online scrub has something even better to hook into temporarily for this sort of thing? Mount performance would obviously be bad, but that doesn't matter for the purposes of a prototype. The goal should really be that once mounted, you have established expected writeability invariants and have the ability to test for reliable prevention of -ENOSPC errors from dm-thin from that point forward. If that ultimately works, then refine the ideal implementation from there and ask dm to do whatever writeability tracking and whatnot. FWIW, that may also help deal with things like the fact that xfs_repair can basically relocate the entire set of filesystem metadata to completely different ranges of free space, completely breaking any writeability guarantees tracked by previous provisions of those ranges. Brian > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com >
On Fri, May 26, 2023 at 12:04:02PM +0100, Joe Thornber wrote: > Here's my take: > > I don't see why the filesystem cares if thinp is doing a reservation or > provisioning under the hood. All that matters is that a future write > to that region will be honoured (barring device failure etc.). > > I agree that the reservation/force mapped status needs to be inherited > by snapshots. > > > One of the few strengths of thinp is the performance of taking a snapshot. > Most snapshots created are never activated. Many other snapshots are > only alive for a brief period, and used read-only. eg, blk-archive > (https://github.com/jthornber/blk-archive) uses snapshots to do very > fast incremental backups. As such I'm strongly against any scheme that > requires provisioning as part of the snapshot operation. > > Hank and I are in the middle of the range tree work which requires a > metadata > change. So now is a convenient time to piggyback other metadata changes to > support reservations. > > > Given the above this is what I suggest: > > 1) We have an api (ioctl, bio flag, whatever) that lets you > reserve/guarantee a region: > > int reserve_region(dev, sector_t begin, sector_t end); A C-based interface is not sufficient because the layer that must do provsioning is not guaranteed to be directly under the filesystem. We must be able to propagate the request down to the layers that need to provision storage, and that includes hardware devices. e.g. dm-thin would have to issue REQ_PROVISION on the LBA ranges it allocates in it's backing device to guarantee that the provisioned LBA range it allocates is also fully provisioned by the storage below it.... > This api should be used minimally, eg, critical FS metadata only. Keep in mind that "critical FS metadata" in this context is any metadata which could cause the filesystem to hang or enter a global error state if an unexpected ENOSPC error occurs during a metadata write IO. Which, in pretty much every journalling filesystem, equates to all metadata in the filesystem. For a typical root filesystem, that might be a in the range of a 1-200MB (depending on journal size). For larger filesytems with lots of files in them, it will be in the range of GBs of space. Plan for having to support tens of GBs of provisioned space in filesystems, not tens of MBs.... [snip] > Now this is a lot of work. As well as the kernel changes we'll need to > update the userland tools: thin_check, thin_ls, thin_metadata_unpack, > thin_rmap, thin_delta, thin_metadata_pack, thin_repair, thin_trim, > thin_dump, thin_metadata_size, thin_restore. Are we confident that we > have buy in from the FS teams that this will be widely adopted? Are users > asking for this? I really don't want to do 6 months of work for nothing. I think there's a 2-3 solid days of coding to fully implement REQ_PROVISION support in XFS, including userspace tool support. Maybe a couple of weeks more to flush the bugs out before it's largely ready to go. So if there's buy in from the block layer and DM people for REQ_PROVISION as described, then I'll definitely have XFS support ready for you to test whenever dm-thinp is ready to go. I can't speak for other filesystems, I suspect the only one we care about is ext4. btrfs and f2fs don't need dm-thinp and there aren't any other filesystems that are used in production on top of dm-thinp, so I think only XFS and ext4 matter at this point in time. I suspect that ext4 would be fairly easy to add support for as well. ext4 has a lot more fixed-place metadata than XFS has so much more of it's metadata is covered by mkfs-time provisioning. Limiting dynamic metadata to specific fully provisioned block groups and provisioning new block groups for metadata when they are near full would be equivalent to how I plan to provision metadata space in XFS. Hence the implementation for ext4 looks to be broadly similar in scope and complexity as XFS.... -Dave.
On Tue, May 30 2023 at 3:27P -0400, Joe Thornber <thornber@redhat.com> wrote: > On Sat, May 27, 2023 at 12:45 AM Dave Chinner <david@fromorbit.com> wrote: > > > On Fri, May 26, 2023 at 12:04:02PM +0100, Joe Thornber wrote: > > > > > 1) We have an api (ioctl, bio flag, whatever) that lets you > > > reserve/guarantee a region: > > > > > > int reserve_region(dev, sector_t begin, sector_t end); > > > > A C-based interface is not sufficient because the layer that must do > > provsioning is not guaranteed to be directly under the filesystem. > > We must be able to propagate the request down to the layers that > > need to provision storage, and that includes hardware devices. > > > > e.g. dm-thin would have to issue REQ_PROVISION on the LBA ranges it > > allocates in it's backing device to guarantee that the provisioned > > LBA range it allocates is also fully provisioned by the storage > > below it.... > > > > Fine, bio flag it is. > > > > > > > This api should be used minimally, eg, critical FS metadata only. > > > > > > > > Plan for having to support tens of GBs of provisioned space in > > filesystems, not tens of MBs.... > > > > Also fine. > > > I think there's a 2-3 solid days of coding to fully implement > > REQ_PROVISION support in XFS, including userspace tool support. > > Maybe a couple of weeks more to flush the bugs out before it's > > largely ready to go. > > > > So if there's buy in from the block layer and DM people for > > REQ_PROVISION as described, then I'll definitely have XFS support > > ready for you to test whenever dm-thinp is ready to go. > > > > Great, this is what I wanted to hear. I guess we need an ack from the > block guys and then I'll get started. The block portion is where this discussion started (in the context of this thread's patchset, now at v7). During our discussion I think there were 2 gaps identified with this patchset: 1) provisioning should be opt-in, and we need a clear flag that upper layers look for to know if REQ_PROVISION available - we do get this with the max_provision_sectors = 0 default, is checking queue_limits (via queue_max_provision_sectors) sufficient for upper layers like xfs? 2) DM thinp needs REQ_PROVISION passdown support - also dm_table_supports_provision() needs to be stricter by requiring _all_ underlying devices support provisioning? Bonus dm-thinp work: add ranged REQ_PROVISION support to reduce number of calls (and bios) block core needs to pass to dm thinp. Also Joe, for you proposed dm-thinp design where you distinquish between "provision" and "reserve": Would it make sense for REQ_META (e.g. all XFS metadata) with REQ_PROVISION to be treated as an LBA-specific hard request? Whereas REQ_PROVISION on its own provides more freedom to just reserve the length of blocks? (e.g. for XFS delalloc where LBA range is unknown, but dm-thinp can be asked to reserve space to accomodate it). Mike
On Tue, May 30 2023 at 10:55P -0400, Joe Thornber <thornber@redhat.com> wrote: > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer <snitzer@kernel.org> wrote: > > > > > Also Joe, for you proposed dm-thinp design where you distinquish > > between "provision" and "reserve": Would it make sense for REQ_META > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an > > LBA-specific hard request? Whereas REQ_PROVISION on its own provides > > more freedom to just reserve the length of blocks? (e.g. for XFS > > delalloc where LBA range is unknown, but dm-thinp can be asked to > > reserve space to accomodate it). > > > > My proposal only involves 'reserve'. Provisioning will be done as part of > the usual io path. OK, I think we'd do well to pin down the top-level block interfaces in question. Because this patchset's block interface patch (2/5) header says: "This patch also adds the capability to call fallocate() in mode 0 on block devices, which will send REQ_OP_PROVISION to the block device for the specified range," So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A user of XFS could then use fallocate() for user data -- which would cause thinp's reserve to _not_ be used for critical metadata. The only way to distinquish the caller (between on-behalf of user data vs XFS metadata) would be REQ_META? So should dm-thinp have a REQ_META-based distinction? Or just treat all REQ_OP_PROVISION the same? Mike
On Tue, May 30, 2023 at 8:28 AM Mike Snitzer <snitzer@kernel.org> wrote: > > On Tue, May 30 2023 at 10:55P -0400, > Joe Thornber <thornber@redhat.com> wrote: > > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer <snitzer@kernel.org> wrote: > > > > > > > > Also Joe, for you proposed dm-thinp design where you distinquish > > > between "provision" and "reserve": Would it make sense for REQ_META > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an > > > LBA-specific hard request? Whereas REQ_PROVISION on its own provides > > > more freedom to just reserve the length of blocks? (e.g. for XFS > > > delalloc where LBA range is unknown, but dm-thinp can be asked to > > > reserve space to accomodate it). > > > > > > > My proposal only involves 'reserve'. Provisioning will be done as part of > > the usual io path. > > OK, I think we'd do well to pin down the top-level block interfaces in > question. Because this patchset's block interface patch (2/5) header > says: > > "This patch also adds the capability to call fallocate() in mode 0 > on block devices, which will send REQ_OP_PROVISION to the block > device for the specified range," > > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A > user of XFS could then use fallocate() for user data -- which would > cause thinp's reserve to _not_ be used for critical metadata. > > The only way to distinquish the caller (between on-behalf of user data > vs XFS metadata) would be REQ_META? > > So should dm-thinp have a REQ_META-based distinction? Or just treat > all REQ_OP_PROVISION the same? > I'm in favor of a REQ_META-based distinction. Does that imply that REQ_META also needs to be passed through the block/filesystem stack (eg. REQ_OP_PROVION + REQ_META on a loop device translates to a fallocate(<insert meta flag name>) to the underlying file)? <bikeshed> I think that might have applications beyond just provisioning: currently, for stacked filesystems (eg filesystems residing in a file on top of another filesystem), even if the upper filesystem issues read/write requests with REQ_META | REQ_PRIO, these flags are lost in translation at the loop device layer. A flag like the above would allow the prioritization of stacked filesystem metadata requests. </bikeshed> Bringing the discussion back to this series for a bit, I'm still waiting on feedback from the Block maintainers before sending out v8 (which at the moment, only have a s/EXPORT_SYMBOL/EXPORT_SYMBOL_GPL/g). I believe from the conversation most of the above is follow up work, but please let me know if you'd prefer I add some of this to the current series! Best Sarthak > Mike
On Fri, Jun 02 2023 at 2:44P -0400, Sarthak Kukreti <sarthakkukreti@chromium.org> wrote: > On Tue, May 30, 2023 at 8:28 AM Mike Snitzer <snitzer@kernel.org> wrote: > > > > On Tue, May 30 2023 at 10:55P -0400, > > Joe Thornber <thornber@redhat.com> wrote: > > > > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer <snitzer@kernel.org> wrote: > > > > > > > > > > > Also Joe, for you proposed dm-thinp design where you distinquish > > > > between "provision" and "reserve": Would it make sense for REQ_META > > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an > > > > LBA-specific hard request? Whereas REQ_PROVISION on its own provides > > > > more freedom to just reserve the length of blocks? (e.g. for XFS > > > > delalloc where LBA range is unknown, but dm-thinp can be asked to > > > > reserve space to accomodate it). > > > > > > > > > > My proposal only involves 'reserve'. Provisioning will be done as part of > > > the usual io path. > > > > OK, I think we'd do well to pin down the top-level block interfaces in > > question. Because this patchset's block interface patch (2/5) header > > says: > > > > "This patch also adds the capability to call fallocate() in mode 0 > > on block devices, which will send REQ_OP_PROVISION to the block > > device for the specified range," > > > > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A > > user of XFS could then use fallocate() for user data -- which would > > cause thinp's reserve to _not_ be used for critical metadata. > > > > The only way to distinquish the caller (between on-behalf of user data > > vs XFS metadata) would be REQ_META? > > > > So should dm-thinp have a REQ_META-based distinction? Or just treat > > all REQ_OP_PROVISION the same? > > > I'm in favor of a REQ_META-based distinction. Does that imply that > REQ_META also needs to be passed through the block/filesystem stack > (eg. REQ_OP_PROVION + REQ_META on a loop device translates to a > fallocate(<insert meta flag name>) to the underlying file)? Unclear, I was thinking your REQ_UNSHARE (tied to fallocate) might be a means to translate REQ_OP_PROVISION + REQ_META to fallocate and have it perform the LBA-specific provisioning of Joe's design (referenced below). > <bikeshed> > I think that might have applications beyond just provisioning: > currently, for stacked filesystems (eg filesystems residing in a file > on top of another filesystem), even if the upper filesystem issues > read/write requests with REQ_META | REQ_PRIO, these flags are lost in > translation at the loop device layer. A flag like the above would > allow the prioritization of stacked filesystem metadata requests. > </bikeshed> Yes, it could prove useful. > Bringing the discussion back to this series for a bit, I'm still > waiting on feedback from the Block maintainers before sending out v8 > (which at the moment, only have a > s/EXPORT_SYMBOL/EXPORT_SYMBOL_GPL/g). I believe from the conversation > most of the above is follow up work, but please let me know if you'd > prefer I add some of this to the current series! I need a bit more time to work through various aspects of the broader requirements and the resulting interfaces that fall out. Joe's design is pretty compelling because it will properly handle snapshot thin devices: https://listman.redhat.com/archives/dm-devel/2023-May/054351.html Here is my latest status: - Focused on prototype for thinp block reservation (XFS metadata, XFS delalloc, fallocate) - Decided the "dynamic" (non-LBA specific) reservation stuff (old prototype code) is best left independent from Joe's design. SO 2 classes of thinp reservation. - Forward-ported the old prototype code that Brian Foster, Joe Thornber and I worked on years ago. It needs more careful review (and very likely will need fixes from Brian and myself). The XFS changes are pretty intrusive and likely up for serious debate (as to whether we even care to handle reservations for user data). - REQ_OP_PROVISION bio’s with REQ_META will use Joe’s design, otherwise data (XFS data and fallocate) will use “dynamic” reservation. - "dynamic" name is due to the reservation being generic (non-LBA: not in terms of an LBA range). Also, in-core only; so the associated “dynamic_reserve_count” accounting is reset to 0 every activation. - Fallocate may require stronger guarantees in the end (in which case we’ll add a REQ_UNSHARE flag that is selectable from the fallocate interface) - Will try to share common code, but just sorting out highlevel interface(s) still... I'll try to get a git tree together early next week. It will be the forward ported "dynamic" prototype code and your latest v7 code with some additional work to branch accordingly for each class of thinp reservation. And I'll use your v7 code as a crude stub for Joe's approach (branch taken if REQ_META set). Lastly, here are some additional TODOs I've noted in code earlier in my review process: diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c index 0d9301802609..43a6702f9efe 100644 --- a/drivers/md/dm-thin.c +++ b/drivers/md/dm-thin.c @@ -1964,6 +1964,26 @@ static void process_provision_bio(struct thin_c *tc, struct bio *bio) struct dm_cell_key key; struct dm_thin_lookup_result lookup_result; + /* + * FIXME: + * Joe's elegant reservation design is detailed here: + * https://listman.redhat.com/archives/dm-devel/2023-May/054351.html + * - this design, with associated thinp metadata updates, + * is how provision bios should be handled. + * + * FIXME: add thin-pool flag "ignore_provision" + * + * FIXME: needs provision_passdown support + * (needs thinp flag "no_provision_passdown") + */ + + /* + * FIXME: require REQ_META (or REQ_UNSHARE?) to allow deeper + * provisioning code that follows? (so that thinp + * block _is_ fully provisioned upon return) + * (or just remove all below code entirely?) + */ + /* * If cell is already occupied, then the block is already * being provisioned so we have nothing further to do here.
On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote: > On Tue, May 30, 2023 at 8:28 AM Mike Snitzer <snitzer@kernel.org> wrote: > > > > On Tue, May 30 2023 at 10:55P -0400, > > Joe Thornber <thornber@redhat.com> wrote: > > > > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer <snitzer@kernel.org> wrote: > > > > > > > > > > > Also Joe, for you proposed dm-thinp design where you distinquish > > > > between "provision" and "reserve": Would it make sense for REQ_META > > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an > > > > LBA-specific hard request? Whereas REQ_PROVISION on its own provides > > > > more freedom to just reserve the length of blocks? (e.g. for XFS > > > > delalloc where LBA range is unknown, but dm-thinp can be asked to > > > > reserve space to accomodate it). > > > > > > > > > > My proposal only involves 'reserve'. Provisioning will be done as part of > > > the usual io path. > > > > OK, I think we'd do well to pin down the top-level block interfaces in > > question. Because this patchset's block interface patch (2/5) header > > says: > > > > "This patch also adds the capability to call fallocate() in mode 0 > > on block devices, which will send REQ_OP_PROVISION to the block > > device for the specified range," > > > > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A > > user of XFS could then use fallocate() for user data -- which would > > cause thinp's reserve to _not_ be used for critical metadata. Mike, I think you might have misunderstood what I have been proposing. Possibly unintentionally, I didn't call it REQ_OP_PROVISION but that's what I intended - the operation does not contain data at all. It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it contains a range of sectors that need to be provisioned (or discarded), and nothing else. The write IOs themselves are not tagged with anything special at all. i.e. The proposal I made does not use REQ_PROVISION anywhere in the metadata/data IO path; provisioned regions are created by separate operations and must be tracked by the underlying block device, then treat any write IO to those regions as "must not fail w/ ENOSPC" IOs. There seems to be a lot of fear about user data requiring provisioning. This is unfounded - provisioning is only needed for explicitly provisioned space via fallocate(), not every byte of user data written to the filesystem (the model Brian is proposing). Excessive use of fallocate() is self correcting - if users and/or their applications provision too much, they are going to get ENOSPC or have to pay more to expand the backing pool reserves they need. But that's not a problem the block device should be trying to solve; that's a problem for the sysadmin and/or bean counters to address. > > > > The only way to distinquish the caller (between on-behalf of user data > > vs XFS metadata) would be REQ_META? > > > > So should dm-thinp have a REQ_META-based distinction? Or just treat > > all REQ_OP_PROVISION the same? > > > I'm in favor of a REQ_META-based distinction. Why? What *requirement* is driving the need for this distinction? As the person who proposed this new REQ_OP_PROVISION architecture, I'm dead set against it. Allowing the block device provide a set of poorly defined "conditional guarantees" policies instead of a mechanism with a single ironclad guarantee defeats the entire purpose of the proposal. We have a requirement from the *kernel ABI* that *user data writes* must not fail with ENOSPC after an fallocate() operation. That's one of the high level policies we need to implement. The filesystem is already capable of guaranteeing it won't give the user ENOSPC after fallocate, we now need a guarantee from the filesystem's backing store that it won't give ENOSPC, too. The _other thing_ we need to implement is a method of guaranteeing the filesystem won't shut down when the backing device goes ENOSPC unexpected during metadata writeback. So we also need the backing device to guarantee the regions we write metadata to won't give ENOSPC. That's the whole point of REQ_OP_PROVISION: from the layers above the block device, there is -zero- difference between the guarantee we need for user data writes to avoid ENOSPC and for metadata writes to avoid ENOSPC. They are one and the same. Hence if the block device is going to say "I support provisioning" but then give different conditional guarantees according to the *type of data* in the IO request, then it does not provide the functionality the higher layers actually require from it. Indeed, what type of data the IO contains is *context dependent*. For example, sometimes we write metadata with user data IO and but we still need provisioning guarantees as if it was issued as metadata IO. This is the case for mkfs initialising the file system by writing directly to the block device. IOWs, filesystem metadata IO issued from kernel context would be considered metadata IO, but from userspace it would be considered normal user data IO and hence treated differently. But the reality is that they both need the same provisioning guarantees to be provided by the block device. So how do userspace tools deal with this if the block device requires REQ_META on user data IOs to do the right thing here? And if we provide a mechanism to allow this, how do we prevent userspace for always using it on writes to fallocate() provisioned space? It's just not practical for the block device to add arbitrary constraints based on the type of IO because we then have to add mechanisms to userspace APIs to allow them to control the IO context so the block device will do the right thing. Especially considering we really only need one type of guarantee regardless of where the IO originates from or what type of data the IO contains.... > Does that imply that > REQ_META also needs to be passed through the block/filesystem stack > (eg. REQ_OP_PROVION + REQ_META on a loop device translates to a > fallocate(<insert meta flag name>) to the underlying file)? This is exactly the same case as above: the loopback device does user data IO to the backing file. Hence we have another situation where metadata IO is issued to fallocate()d user data ranges as user data ranges and so would be given a lesser guarantee that would lead to upper filesystem failure. BOth upper and lower filesystem data and metadata need to be provided the same ENOSPC guarantees by their backing stores.... The whole point of the REQ_OP_PROVISION proposal I made is that it doesn't require any special handling in corner cases like this. There are no cross-layer interactions needed to make everything work correctly because the provisioning guarantee is not -data type dependent*. The entire user IO path code remains untouched and blissfully unaware of provisioned regions. And, realistically, if we have to start handling complex corner cases in the filesystem and IO path layers to make REQ_OP_PROVISION work correctly because of arbitary constraints imposed by the block layer implementations, then we've failed miserably at the design and architecture stage. Keep in mind that every attempt made so far to address the problems with block device ENOSPC errors has failed because of the complexity of the corner cases that have arisen during design and/or implementation. It's pretty ironic that now we have a proposal that is remarkably simple, free of corner cases and has virtually no cross-layer coupling at all, the first thing that people want to do is add arbitrary implementation constraints that result in complex cross-layer corner cases that now need to be handled.... Put simply: if we restrict REQ_OP_PROVISION guarantees to just REQ_META writes (or any other specific type of write operation) then it's simply not worth persuing at the filesystem level because the guarantees we actually need just aren't there and the complexity of discovering and handling those corner cases just isn't worth the effort. Cheers, Dave.
On Fri, Jun 02 2023 at 8:52P -0400, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote: > > On Tue, May 30, 2023 at 8:28 AM Mike Snitzer <snitzer@kernel.org> wrote: > > > > > > On Tue, May 30 2023 at 10:55P -0400, > > > Joe Thornber <thornber@redhat.com> wrote: > > > > > > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer <snitzer@kernel.org> wrote: > > > > > > > > > > > > > > Also Joe, for you proposed dm-thinp design where you distinquish > > > > > between "provision" and "reserve": Would it make sense for REQ_META > > > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an > > > > > LBA-specific hard request? Whereas REQ_PROVISION on its own provides > > > > > more freedom to just reserve the length of blocks? (e.g. for XFS > > > > > delalloc where LBA range is unknown, but dm-thinp can be asked to > > > > > reserve space to accomodate it). > > > > > > > > > > > > > My proposal only involves 'reserve'. Provisioning will be done as part of > > > > the usual io path. > > > > > > OK, I think we'd do well to pin down the top-level block interfaces in > > > question. Because this patchset's block interface patch (2/5) header > > > says: > > > > > > "This patch also adds the capability to call fallocate() in mode 0 > > > on block devices, which will send REQ_OP_PROVISION to the block > > > device for the specified range," > > > > > > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A > > > user of XFS could then use fallocate() for user data -- which would > > > cause thinp's reserve to _not_ be used for critical metadata. > > Mike, I think you might have misunderstood what I have been proposing. > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but > that's what I intended - the operation does not contain data at all. > It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it > contains a range of sectors that need to be provisioned (or > discarded), and nothing else. No, I understood that. > The write IOs themselves are not tagged with anything special at all. I know, but I've been looking at how to also handle the delalloc usecase (and yes I know you feel it doesn't need handling, the issue is XFS does deal nicely with ensuring it has space when it tracks its allocations on "thick" storage -- so adding coordination between XFS and dm-thin layers provides comparable safety.. that safety is an expected norm). But rather than discuss in terms of data vs metadata, the distinction is: 1) LBA range reservation (normal case, your proposal) 2) non-LBA reservation (absolute value, LBA range is known at later stage) But I'm clearly going off script for dwelling on wanting to handle both. My looking at (ab)using REQ_META being set (use 1) vs not (use 2) was a crude simplification for branching between the 2 approaches. And I understand I made you nervous by expanding the scope to a much more muddled/shitty interface. ;) We all just need to focus on your proposal and Joe's dm-thin reservation design... [Sarthak: FYI, this implies that it doesn't really make sense to add dm-thinp support before Joe's design is implemented. Otherwise we'll have 2 different responses to REQ_OP_PROVISION. The one that is captured in your patchset isn't adequate to properly handle ensuring upper layer (like XFS) can depend on the space being available across snapshot boundaries.] > i.e. The proposal I made does not use REQ_PROVISION anywhere in the > metadata/data IO path; provisioned regions are created by separate > operations and must be tracked by the underlying block device, then > treat any write IO to those regions as "must not fail w/ ENOSPC" > IOs. > > There seems to be a lot of fear about user data requiring > provisioning. This is unfounded - provisioning is only needed for > explicitly provisioned space via fallocate(), not every byte of > user data written to the filesystem (the model Brian is proposing). As I mentioned above, I was just trying to get XFS-on-thinp to maintain parity with how XFS's delalloc accounting works on "thick" storage. But happy to put that to one side. Maintain focus like I mentioned above. I'm happy we have momentum and agreement on this design now. Rather than be content with that, I was mistakenly looking at other aspects and in doing so introduced "noise" before we've implemented what we all completely agree on: your and joe's designs. > Excessive use of fallocate() is self correcting - if users and/or > their applications provision too much, they are going to get ENOSPC > or have to pay more to expand the backing pool reserves they need. > But that's not a problem the block device should be trying to solve; > that's a problem for the sysadmin and/or bean counters to address. > > > > > > > The only way to distinquish the caller (between on-behalf of user data > > > vs XFS metadata) would be REQ_META? > > > > > > So should dm-thinp have a REQ_META-based distinction? Or just treat > > > all REQ_OP_PROVISION the same? > > > > > I'm in favor of a REQ_META-based distinction. > > Why? What *requirement* is driving the need for this distinction? Think I answered that above, XFS delalloc accounting parity on thinp. > As the person who proposed this new REQ_OP_PROVISION architecture, > I'm dead set against it. Allowing the block device provide a set of > poorly defined "conditional guarantees" policies instead of a > mechanism with a single ironclad guarantee defeats the entire > purpose of the proposal. > > We have a requirement from the *kernel ABI* that *user data writes* > must not fail with ENOSPC after an fallocate() operation. That's > one of the high level policies we need to implement. The filesystem > is already capable of guaranteeing it won't give the user ENOSPC > after fallocate, we now need a guarantee from the filesystem's > backing store that it won't give ENOSPC, too. Yes, I was trying to navigate Joe's reluctance to even support fallocate() for arbitrary user data. That's where the REQ_META vs data distinction crept in for me. But as you say: using fallocate() excessively is self-correcting. > The _other thing_ we need to implement is a method of guaranteeing > the filesystem won't shut down when the backing device goes ENOSPC > unexpected during metadata writeback. So we also need the backing > device to guarantee the regions we write metadata to won't give > ENOSPC. Yeap. > That's the whole point of REQ_OP_PROVISION: from the layers above > the block device, there is -zero- difference between the guarantee > we need for user data writes to avoid ENOSPC and for metadata writes > to avoid ENOSPC. They are one and the same. I know. The difference comes from delalloc initially needing an absolute value of reserve rather than a specific LBA range. > Hence if the block device is going to say "I support provisioning" > but then give different conditional guarantees according to the > *type of data* in the IO request, then it does not provide the > functionality the higher layers actually require from it. I was going for relaxing the "dynamic" approach (Brian's) to be best-effort -- and really only for XFS delalloc usecase. Every other usecase would respect your and Joe's vision. > Indeed, what type of data the IO contains is *context dependent*. > For example, sometimes we write metadata with user data IO and but > we still need provisioning guarantees as if it was issued as > metadata IO. This is the case for mkfs initialising the file system > by writing directly to the block device. I'm aware. > IOWs, filesystem metadata IO issued from kernel context would be > considered metadata IO, but from userspace it would be considered > normal user data IO and hence treated differently. But the reality > is that they both need the same provisioning guarantees to be > provided by the block device. What I was looking at is making the fallocate interface able to express: I need dave's requirements (bog-standard actually) vs I need non-LBA best effort. > So how do userspace tools deal with this if the block device > requires REQ_META on user data IOs to do the right thing here? And > if we provide a mechanism to allow this, how do we prevent userspace > for always using it on writes to fallocate() provisioned space? > > It's just not practical for the block device to add arbitrary > constraints based on the type of IO because we then have to add > mechanisms to userspace APIs to allow them to control the IO context > so the block device will do the right thing. Especially considering > we really only need one type of guarantee regardless of where the IO > originates from or what type of data the IO contains.... If anything my disposition on the conditional to require a REQ_META (or some fallocate generated REQ_UNSHARE ditto to reflect the same) to perform your approach to REQ_OP_PROVISION and honor fallocate() requirements is a big problem. Would be much better to have a flag to express "this reservation does not have an LBA range _yet_, nevertheless try to be mindful of this expected near-term block allocation". But I'll stop inlining repetitive (similar but different) answers to your concern now ;) > > Does that imply that > > REQ_META also needs to be passed through the block/filesystem stack > > (eg. REQ_OP_PROVION + REQ_META on a loop device translates to a > > fallocate(<insert meta flag name>) to the underlying file)? > > This is exactly the same case as above: the loopback device does > user data IO to the backing file. Hence we have another situation > where metadata IO is issued to fallocate()d user data ranges as user > data ranges and so would be given a lesser guarantee that would lead > to upper filesystem failure. BOth upper and lower filesystem data > and metadata need to be provided the same ENOSPC guarantees by their > backing stores.... > > The whole point of the REQ_OP_PROVISION proposal I made is that it > doesn't require any special handling in corner cases like this. > There are no cross-layer interactions needed to make everything work > correctly because the provisioning guarantee is not -data type > dependent*. The entire user IO path code remains untouched and > blissfully unaware of provisioned regions. > > And, realistically, if we have to start handling complex corner > cases in the filesystem and IO path layers to make REQ_OP_PROVISION > work correctly because of arbitary constraints imposed by the block > layer implementations, then we've failed miserably at the design and > architecture stage. > > Keep in mind that every attempt made so far to address the problems > with block device ENOSPC errors has failed because of the complexity > of the corner cases that have arisen during design and/or > implementation. It's pretty ironic that now we have a proposal that > is remarkably simple, free of corner cases and has virtually no > cross-layer coupling at all, the first thing that people want to do > is add arbitrary implementation constraints that result in complex > cross-layer corner cases that now need to be handled.... > > Put simply: if we restrict REQ_OP_PROVISION guarantees to just > REQ_META writes (or any other specific type of write operation) then > it's simply not worth persuing at the filesystem level because the > guarantees we actually need just aren't there and the complexity of > discovering and handling those corner cases just isn't worth the > effort. Here is where I get to say: I think you misunderstood me (but it was my fault for not being absolutely clear: I'm very much on the same page as you and Joe; and your visions need to just be implemented ASAP). I was taking your designs as a given, but looking further at: how do we also handle the non-LBA (delalloc) usecase _before_ we include REQ_OP_PROVISION in kernel. But I'm happy to let the delalloc case go (we can revisit addressing it if/when needed). Mike
On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer <snitzer@kernel.org> wrote: > > On Fri, Jun 02 2023 at 8:52P -0400, > Dave Chinner <david@fromorbit.com> wrote: > > > On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote: > > > On Tue, May 30, 2023 at 8:28 AM Mike Snitzer <snitzer@kernel.org> wrote: > > > > > > > > On Tue, May 30 2023 at 10:55P -0400, > > > > Joe Thornber <thornber@redhat.com> wrote: > > > > > > > > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer <snitzer@kernel.org> wrote: > > > > > > > > > > > > > > > > > Also Joe, for you proposed dm-thinp design where you distinquish > > > > > > between "provision" and "reserve": Would it make sense for REQ_META > > > > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an > > > > > > LBA-specific hard request? Whereas REQ_PROVISION on its own provides > > > > > > more freedom to just reserve the length of blocks? (e.g. for XFS > > > > > > delalloc where LBA range is unknown, but dm-thinp can be asked to > > > > > > reserve space to accomodate it). > > > > > > > > > > > > > > > > My proposal only involves 'reserve'. Provisioning will be done as part of > > > > > the usual io path. > > > > > > > > OK, I think we'd do well to pin down the top-level block interfaces in > > > > question. Because this patchset's block interface patch (2/5) header > > > > says: > > > > > > > > "This patch also adds the capability to call fallocate() in mode 0 > > > > on block devices, which will send REQ_OP_PROVISION to the block > > > > device for the specified range," > > > > > > > > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A > > > > user of XFS could then use fallocate() for user data -- which would > > > > cause thinp's reserve to _not_ be used for critical metadata. > > > > Mike, I think you might have misunderstood what I have been proposing. > > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but > > that's what I intended - the operation does not contain data at all. > > It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it > > contains a range of sectors that need to be provisioned (or > > discarded), and nothing else. > > No, I understood that. > > > The write IOs themselves are not tagged with anything special at all. > > I know, but I've been looking at how to also handle the delalloc > usecase (and yes I know you feel it doesn't need handling, the issue > is XFS does deal nicely with ensuring it has space when it tracks its > allocations on "thick" storage -- so adding coordination between XFS > and dm-thin layers provides comparable safety.. that safety is an > expected norm). > > But rather than discuss in terms of data vs metadata, the distinction > is: > 1) LBA range reservation (normal case, your proposal) > 2) non-LBA reservation (absolute value, LBA range is known at later stage) > > But I'm clearly going off script for dwelling on wanting to handle > both. > > My looking at (ab)using REQ_META being set (use 1) vs not (use 2) was > a crude simplification for branching between the 2 approaches. > > And I understand I made you nervous by expanding the scope to a much > more muddled/shitty interface. ;) > > We all just need to focus on your proposal and Joe's dm-thin > reservation design... > > [Sarthak: FYI, this implies that it doesn't really make sense to add > dm-thinp support before Joe's design is implemented. Otherwise we'll > have 2 different responses to REQ_OP_PROVISION. The one that is > captured in your patchset isn't adequate to properly handle ensuring > upper layer (like XFS) can depend on the space being available across > snapshot boundaries.] > Ack. Would it be premature for the rest of the series to go through (REQ_OP_PROVISION + support for loop and non-dm-thinp device-mapper targets)? I'd like to start using this as a reference to suggest additions to the virtio-spec for virtio-blk support and start looking at what an ext4 implementation would look like. > > i.e. The proposal I made does not use REQ_PROVISION anywhere in the > > metadata/data IO path; provisioned regions are created by separate > > operations and must be tracked by the underlying block device, then > > treat any write IO to those regions as "must not fail w/ ENOSPC" > > IOs. > > > > There seems to be a lot of fear about user data requiring > > provisioning. This is unfounded - provisioning is only needed for > > explicitly provisioned space via fallocate(), not every byte of > > user data written to the filesystem (the model Brian is proposing). > > As I mentioned above, I was just trying to get XFS-on-thinp to > maintain parity with how XFS's delalloc accounting works on "thick" > storage. > > But happy to put that to one side. Maintain focus like I mentioned > above. I'm happy we have momentum and agreement on this design now. > Rather than be content with that, I was mistakenly looking at other > aspects and in doing so introduced "noise" before we've implemented > what we all completely agree on: your and joe's designs. > > > Excessive use of fallocate() is self correcting - if users and/or > > their applications provision too much, they are going to get ENOSPC > > or have to pay more to expand the backing pool reserves they need. > > But that's not a problem the block device should be trying to solve; > > that's a problem for the sysadmin and/or bean counters to address. > > > > > > > > > > The only way to distinquish the caller (between on-behalf of user data > > > > vs XFS metadata) would be REQ_META? > > > > > > > > So should dm-thinp have a REQ_META-based distinction? Or just treat > > > > all REQ_OP_PROVISION the same? > > > > > > > I'm in favor of a REQ_META-based distinction. > > > > Why? What *requirement* is driving the need for this distinction? > > Think I answered that above, XFS delalloc accounting parity on thinp. > I actually had a few different use-cases in mind (apart from the user data provisioning 'fear' that you pointed out): in essence, there are cases where userspace would benefit from having more control over how much space a snapshot takes: 1) In the original RFC patchset [1], I alluded to this being a mechanism for pre-allocating space for preserving space for thin logical volumes. The use-case I'd like to explore is delta updatable read-only filesystems similar to systemd system extensions [2]: In essence: a) Preserve space for a 'base' thin logical volume that will contain a read-only filesystem on over-the-air installation: for filesystems like squashfs and erofs, pretty much the entire image is a compressed file that I'd like to reserve space for before installation. b) Before update, create a thin snapshot and preserve enough space to ensure that a delta update will succeed (eg. block level diff of the base image). Then, the update is guaranteed to have disk space to succeed (similar to the A-B update guarantees on ChromeOS). On success, we merge the snapshot and reserve an update snapshot for the next possible update. On failure, we drop the snapshot. 2) The other idea I wanted to explore was rollback protection for stateful filesystem features: in essence, if an update from kernel 4.x to 5.y failed very quickly (due to unrelated reasons) and we enabled some stateful filesystem features that are only supported on 5.y, we'd be able to rollback to 4.x if we used short-lived snapshots (in the ChromiumOS world, the lifetime of these snapshots would be < 10s per boot). On reflection, the metadata vs user data distinction was a means to an end for me: I'd like to retain the capability to create thin short-lived snapshots from userspace _regardless_ of the provisioned ranges of a thin volume and the flexibility to manipulate the space requirements on these snapshot volumes from userspace. This might appear as "bean-counting" but if I have eg. a 4GB read-only compressed filesystem and I know, a priori, an update will take at most 400M, I shouldn't need to (momentarily) reserve another 4GB or add more disks to complete the update. [1] https://lkml.org/lkml/2022/9/15/785 [2] https://www.freedesktop.org/software/systemd/man/systemd-sysext.html > > As the person who proposed this new REQ_OP_PROVISION architecture, > > I'm dead set against it. Allowing the block device provide a set of > > poorly defined "conditional guarantees" policies instead of a > > mechanism with a single ironclad guarantee defeats the entire > > purpose of the proposal. > > > > We have a requirement from the *kernel ABI* that *user data writes* > > must not fail with ENOSPC after an fallocate() operation. That's > > one of the high level policies we need to implement. The filesystem > > is already capable of guaranteeing it won't give the user ENOSPC > > after fallocate, we now need a guarantee from the filesystem's > > backing store that it won't give ENOSPC, too. > > Yes, I was trying to navigate Joe's reluctance to even support > fallocate() for arbitrary user data. That's where the REQ_META vs > data distinction crept in for me. But as you say: using fallocate() > excessively is self-correcting. > > > The _other thing_ we need to implement is a method of guaranteeing > > the filesystem won't shut down when the backing device goes ENOSPC > > unexpected during metadata writeback. So we also need the backing > > device to guarantee the regions we write metadata to won't give > > ENOSPC. > > Yeap. > > > That's the whole point of REQ_OP_PROVISION: from the layers above > > the block device, there is -zero- difference between the guarantee > > we need for user data writes to avoid ENOSPC and for metadata writes > > to avoid ENOSPC. They are one and the same. > > I know. The difference comes from delalloc initially needing an > absolute value of reserve rather than a specific LBA range. > > > Hence if the block device is going to say "I support provisioning" > > but then give different conditional guarantees according to the > > *type of data* in the IO request, then it does not provide the > > functionality the higher layers actually require from it. > > I was going for relaxing the "dynamic" approach (Brian's) to be > best-effort -- and really only for XFS delalloc usecase. Every other > usecase would respect your and Joe's vision. > > > Indeed, what type of data the IO contains is *context dependent*. > > For example, sometimes we write metadata with user data IO and but > > we still need provisioning guarantees as if it was issued as > > metadata IO. This is the case for mkfs initialising the file system > > by writing directly to the block device. > > I'm aware. > > > IOWs, filesystem metadata IO issued from kernel context would be > > considered metadata IO, but from userspace it would be considered > > normal user data IO and hence treated differently. But the reality > > is that they both need the same provisioning guarantees to be > > provided by the block device. > > What I was looking at is making the fallocate interface able to > express: I need dave's requirements (bog-standard actually) vs I need > non-LBA best effort. > > > So how do userspace tools deal with this if the block device > > requires REQ_META on user data IOs to do the right thing here? And > > if we provide a mechanism to allow this, how do we prevent userspace > > for always using it on writes to fallocate() provisioned space? > > > > It's just not practical for the block device to add arbitrary > > constraints based on the type of IO because we then have to add > > mechanisms to userspace APIs to allow them to control the IO context > > so the block device will do the right thing. Especially considering > > we really only need one type of guarantee regardless of where the IO > > originates from or what type of data the IO contains.... > > If anything my disposition on the conditional to require a REQ_META > (or some fallocate generated REQ_UNSHARE ditto to reflect the same) to > perform your approach to REQ_OP_PROVISION and honor fallocate() > requirements is a big problem. Would be much better to have a flag to > express "this reservation does not have an LBA range _yet_, > nevertheless try to be mindful of this expected near-term block > allocation". > > But I'll stop inlining repetitive (similar but different) answers to > your concern now ;) > > > > Does that imply that > > > REQ_META also needs to be passed through the block/filesystem stack > > > (eg. REQ_OP_PROVION + REQ_META on a loop device translates to a > > > fallocate(<insert meta flag name>) to the underlying file)? > > > > This is exactly the same case as above: the loopback device does > > user data IO to the backing file. Hence we have another situation > > where metadata IO is issued to fallocate()d user data ranges as user > > data ranges and so would be given a lesser guarantee that would lead > > to upper filesystem failure. BOth upper and lower filesystem data > > and metadata need to be provided the same ENOSPC guarantees by their > > backing stores.... > > > > The whole point of the REQ_OP_PROVISION proposal I made is that it > > doesn't require any special handling in corner cases like this. > > There are no cross-layer interactions needed to make everything work > > correctly because the provisioning guarantee is not -data type > > dependent*. The entire user IO path code remains untouched and > > blissfully unaware of provisioned regions. > > > > And, realistically, if we have to start handling complex corner > > cases in the filesystem and IO path layers to make REQ_OP_PROVISION > > work correctly because of arbitary constraints imposed by the block > > layer implementations, then we've failed miserably at the design and > > architecture stage. > > > > Keep in mind that every attempt made so far to address the problems > > with block device ENOSPC errors has failed because of the complexity > > of the corner cases that have arisen during design and/or > > implementation. It's pretty ironic that now we have a proposal that > > is remarkably simple, free of corner cases and has virtually no > > cross-layer coupling at all, the first thing that people want to do > > is add arbitrary implementation constraints that result in complex > > cross-layer corner cases that now need to be handled.... > > > > Put simply: if we restrict REQ_OP_PROVISION guarantees to just > > REQ_META writes (or any other specific type of write operation) then > > it's simply not worth persuing at the filesystem level because the > > guarantees we actually need just aren't there and the complexity of > > discovering and handling those corner cases just isn't worth the > > effort. Fair points, I certainly don't want to derail this conversation; I'd be happy to see this work merged sooner rather than later. For posterity, I'll distill what I said above into the following: I'd like a capability for userspace to create thin snapshots that ignore the thin volume's provisioned areas. IOW, an opt-in flag which makes snapshots fallback to what they do today to provide flexibility to userspace to decide the space requirements for the above mentioned scenarios, and at the same time, not adding separate corner case handling for filesystems. But to reiterate, my intent isn't to pile this onto the work you, Mike and Joe have planned; just some insight into why I'm in favor of ideas that reduce the snapshot size. Best Sarthak > > Here is where I get to say: I think you misunderstood me (but it was > my fault for not being absolutely clear: I'm very much on the same > page as you and Joe; and your visions need to just be implemented > ASAP). > > I was taking your designs as a given, but looking further at: how do > we also handle the non-LBA (delalloc) usecase _before_ we include > REQ_OP_PROVISION in kernel. > > But I'm happy to let the delalloc case go (we can revisit addressing > it if/when needed). > > Mike
On Sat, Jun 03, 2023 at 11:57:48AM -0400, Mike Snitzer wrote: > On Fri, Jun 02 2023 at 8:52P -0400, > Dave Chinner <david@fromorbit.com> wrote: > > > On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote: > > > On Tue, May 30, 2023 at 8:28 AM Mike Snitzer <snitzer@kernel.org> wrote: > > > > > > > > On Tue, May 30 2023 at 10:55P -0400, > > > > Joe Thornber <thornber@redhat.com> wrote: > > > > > > > > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer <snitzer@kernel.org> wrote: > > > > > > > > > > > > > > > > > Also Joe, for you proposed dm-thinp design where you distinquish > > > > > > between "provision" and "reserve": Would it make sense for REQ_META > > > > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an > > > > > > LBA-specific hard request? Whereas REQ_PROVISION on its own provides > > > > > > more freedom to just reserve the length of blocks? (e.g. for XFS > > > > > > delalloc where LBA range is unknown, but dm-thinp can be asked to > > > > > > reserve space to accomodate it). > > > > > > > > > > > > > > > > My proposal only involves 'reserve'. Provisioning will be done as part of > > > > > the usual io path. > > > > > > > > OK, I think we'd do well to pin down the top-level block interfaces in > > > > question. Because this patchset's block interface patch (2/5) header > > > > says: > > > > > > > > "This patch also adds the capability to call fallocate() in mode 0 > > > > on block devices, which will send REQ_OP_PROVISION to the block > > > > device for the specified range," > > > > > > > > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A > > > > user of XFS could then use fallocate() for user data -- which would > > > > cause thinp's reserve to _not_ be used for critical metadata. > > > > Mike, I think you might have misunderstood what I have been proposing. > > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but > > that's what I intended - the operation does not contain data at all. > > It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it > > contains a range of sectors that need to be provisioned (or > > discarded), and nothing else. > > No, I understood that. > > > The write IOs themselves are not tagged with anything special at all. > > I know, but I've been looking at how to also handle the delalloc > usecase (and yes I know you feel it doesn't need handling, the issue > is XFS does deal nicely with ensuring it has space when it tracks its > allocations on "thick" storage Oh, no it doesn't. It -works for most cases-, but that does not mean it provides any guarantees at all. We can still get ENOSPC for user data when delayed allocation reservations "run out". This may be news to you, but the ephemeral XFS delayed allocation space reservation is not accurate. It contains a "fudge factor" called "indirect length". This is a "wet finger in the wind" estimation of how much new metadata will need to be allocated to index the physical allocations when they are made. It assumes large data extents are allocated, which is good enough for most cases, but it is no guarantee when there are no large data extents available to allocate (e.g. near ENOSPC!). And therein lies the fundamental problem with ephemeral range reservations: at the time of reservation, we don't know how many individual physical LBA ranges the reserved data range is actually going to span. As a result, XFS delalloc reservations are a "close-but-not-quite" reservation backed by a global reserve pool that can be dipped into if we run out of delalloc reservation. If the reserve pool is then fully depleted before all delalloc conversion completes, we'll still give ENOSPC. The pool is sized such that the vast majority of workloads will complete delalloc conversion successfully before the pool is depleted. Hence XFS gives everyone the -appearance- that it deals nicely with ENOSPC conditions, but it never provides a -guarantee- that any accepted write will always succeed without ENOSPC. IMO, using this "close-but-not-quite" reservation as the basis of space requirements for other layers to provide "won't ENOSPC" guarantees is fraught with problems. We already know that it is insufficient in important corner cases at the filesystem level, and we also know that lower layers trying to do ephemeral space reservations will have exactly the same problems providing a guarantee. And these are problems we've been unable to engineer around in the past, so the likelihood we can engineer around them now or in the future is also very unlikely. > -- so adding coordination between XFS > and dm-thin layers provides comparable safety.. that safety is an > expected norm). > > But rather than discuss in terms of data vs metadata, the distinction > is: > 1) LBA range reservation (normal case, your proposal) > 2) non-LBA reservation (absolute value, LBA range is known at later stage) > > But I'm clearly going off script for dwelling on wanting to handle > both. Right, because if we do 1) then we don't need 2). :) > My looking at (ab)using REQ_META being set (use 1) vs not (use 2) was > a crude simplification for branching between the 2 approaches. > > And I understand I made you nervous by expanding the scope to a much > more muddled/shitty interface. ;) Nervous? No, I'm simply trying to make sure that everyone is on the same page. i.e. that if we water down the guarantee that 1) relies on, then it's not actually useful to filesystems at all. > > It's just not practical for the block device to add arbitrary > > constraints based on the type of IO because we then have to add > > mechanisms to userspace APIs to allow them to control the IO context > > so the block device will do the right thing. Especially considering > > we really only need one type of guarantee regardless of where the IO > > originates from or what type of data the IO contains.... > > If anything my disposition on the conditional to require a REQ_META > (or some fallocate generated REQ_UNSHARE ditto to reflect the same) to > perform your approach to REQ_OP_PROVISION and honor fallocate() > requirements is a big problem. Would be much better to have a flag to > express "this reservation does not have an LBA range _yet_, > nevertheless try to be mindful of this expected near-term block > allocation". And that's where all the complexity starts ;) > > Put simply: if we restrict REQ_OP_PROVISION guarantees to just > > REQ_META writes (or any other specific type of write operation) then > > it's simply not worth persuing at the filesystem level because the > > guarantees we actually need just aren't there and the complexity of > > discovering and handling those corner cases just isn't worth the > > effort. > > Here is where I get to say: I think you misunderstood me (but it was > my fault for not being absolutely clear: I'm very much on the same > page as you and Joe; and your visions need to just be implemented > ASAP). OK, good that we've clarified the misunderstandings on both sides quickly :) > I was taking your designs as a given, but looking further at: how do > we also handle the non-LBA (delalloc) usecase _before_ we include > REQ_OP_PROVISION in kernel. > > But I'm happy to let the delalloc case go (we can revisit addressing > it if/when needed). Again, I really don't think filesystem delalloc ranges ever need to be covered by block device provisioning guarantees because the filesystem itself provides no guarantees for unprovisioned writes. I suspect that if, in future, we want to manage unprovisioned space in different ways, we're better off taking this sort of approach: https://lore.kernel.org/linux-xfs/20171026083322.20428-1-david@fromorbit.com/ because using grow/shrink to manage the filesystem's unprovisioned space if far, far simpler than trying to use dynamic, cross layer ephemeral reservations. Indeed, with the block device filesystem shutdown path Christoph recently posted, we have a model for adding in-kernel filesystem control interfaces for block devices... There's something to be said for turning everything upside down occasionally. :) Cheers, Dave.
On Mon, Jun 05, 2023 at 02:14:44PM -0700, Sarthak Kukreti wrote: > On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer <snitzer@kernel.org> wrote: > > On Fri, Jun 02 2023 at 8:52P -0400, > > Dave Chinner <david@fromorbit.com> wrote: > > > On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote: > > > > > The only way to distinquish the caller (between on-behalf of user data > > > > > vs XFS metadata) would be REQ_META? > > > > > > > > > > So should dm-thinp have a REQ_META-based distinction? Or just treat > > > > > all REQ_OP_PROVISION the same? > > > > > > > > > I'm in favor of a REQ_META-based distinction. > > > > > > Why? What *requirement* is driving the need for this distinction? > > > > Think I answered that above, XFS delalloc accounting parity on thinp. > > > I actually had a few different use-cases in mind (apart from the user > data provisioning 'fear' that you pointed out): in essence, there are > cases where userspace would benefit from having more control over how > much space a snapshot takes: > > 1) In the original RFC patchset [1], I alluded to this being a > mechanism for pre-allocating space for preserving space for thin > logical volumes. The use-case I'd like to explore is delta updatable > read-only filesystems similar to systemd system extensions [2]: In > essence: > a) Preserve space for a 'base' thin logical volume that will contain a > read-only filesystem on over-the-air installation: for filesystems > like squashfs and erofs, pretty much the entire image is a compressed > file that I'd like to reserve space for before installation. > b) Before update, create a thin snapshot and preserve enough space to > ensure that a delta update will succeed (eg. block level diff of the > base image). Then, the update is guaranteed to have disk space to > succeed (similar to the A-B update guarantees on ChromeOS). On > success, we merge the snapshot and reserve an update snapshot for the > next possible update. On failure, we drop the snapshot. Sounds very similar to the functionality blksnap is supposed to provide.... https://lore.kernel.org/linux-fsdevel/20230404140835.25166-1-sergei.shtepa@veeam.com/ > 2) The other idea I wanted to explore was rollback protection for > stateful filesystem features: in essence, if an update from kernel 4.x > to 5.y failed very quickly (due to unrelated reasons) and we enabled > some stateful filesystem features that are only supported on 5.y, we'd > be able to rollback to 4.x if we used short-lived snapshots (in the > ChromiumOS world, the lifetime of these snapshots would be < 10s per > boot). Not sure that blksnap has a "roll origin back to read-only snapshot" feature yet, but that's what you'd need for this. i.e. on success, drop the snapshot. On failure, "roll origin back to snapshot and reboot". Cheers, Dave.
On Mon, Jun 05 2023 at 5:14P -0400, Sarthak Kukreti <sarthakkukreti@chromium.org> wrote: > On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer <snitzer@kernel.org> wrote: > > > > We all just need to focus on your proposal and Joe's dm-thin > > reservation design... > > > > [Sarthak: FYI, this implies that it doesn't really make sense to add > > dm-thinp support before Joe's design is implemented. Otherwise we'll > > have 2 different responses to REQ_OP_PROVISION. The one that is > > captured in your patchset isn't adequate to properly handle ensuring > > upper layer (like XFS) can depend on the space being available across > > snapshot boundaries.] > > > Ack. Would it be premature for the rest of the series to go through > (REQ_OP_PROVISION + support for loop and non-dm-thinp device-mapper > targets)? I'd like to start using this as a reference to suggest > additions to the virtio-spec for virtio-blk support and start looking > at what an ext4 implementation would look like. Please drop the dm-thin.c and dm-snap.c changes. dm-snap.c would need more work to provide the type of guarantee XFS requires across snapshot boundaries. I'm inclined to _not_ add dm-snap.c support because it is best to just use dm-thin. And FYI even your dm-thin patch will be the starting point for the dm-thin support (we'll keep attribution to you for all the code in a separate patch). > Fair points, I certainly don't want to derail this conversation; I'd > be happy to see this work merged sooner rather than later. Once those dm target changes are dropped I think the rest of the series is fine to go upstream now. Feel free to post a v8. > For posterity, I'll distill what I said above into the following: I'd like > a capability for userspace to create thin snapshots that ignore the > thin volume's provisioned areas. IOW, an opt-in flag which makes > snapshots fallback to what they do today to provide flexibility to > userspace to decide the space requirements for the above mentioned > scenarios, and at the same time, not adding separate corner case > handling for filesystems. But to reiterate, my intent isn't to pile > this onto the work you, Mike and Joe have planned; just some insight > into why I'm in favor of ideas that reduce the snapshot size. I think it'd be useful to ignore a thin device's reservation for read-only snapshots. Adding the ability to create read-only thin snapshots could make sense (later activations don't necessarily need to impose read-only, doing so would require some additional work). Mike
On Tue, Jun 06 2023 at 10:01P -0400, Dave Chinner <david@fromorbit.com> wrote: > On Sat, Jun 03, 2023 at 11:57:48AM -0400, Mike Snitzer wrote: > > On Fri, Jun 02 2023 at 8:52P -0400, > > Dave Chinner <david@fromorbit.com> wrote: > > > > > Mike, I think you might have misunderstood what I have been proposing. > > > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but > > > that's what I intended - the operation does not contain data at all. > > > It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it > > > contains a range of sectors that need to be provisioned (or > > > discarded), and nothing else. > > > > No, I understood that. > > > > > The write IOs themselves are not tagged with anything special at all. > > > > I know, but I've been looking at how to also handle the delalloc > > usecase (and yes I know you feel it doesn't need handling, the issue > > is XFS does deal nicely with ensuring it has space when it tracks its > > allocations on "thick" storage > > Oh, no it doesn't. It -works for most cases-, but that does not mean > it provides any guarantees at all. We can still get ENOSPC for user > data when delayed allocation reservations "run out". > > This may be news to you, but the ephemeral XFS delayed allocation > space reservation is not accurate. It contains a "fudge factor" > called "indirect length". This is a "wet finger in the wind" > estimation of how much new metadata will need to be allocated to > index the physical allocations when they are made. It assumes large > data extents are allocated, which is good enough for most cases, but > it is no guarantee when there are no large data extents available to > allocate (e.g. near ENOSPC!). > > And therein lies the fundamental problem with ephemeral range > reservations: at the time of reservation, we don't know how many > individual physical LBA ranges the reserved data range is actually > going to span. > > As a result, XFS delalloc reservations are a "close-but-not-quite" > reservation backed by a global reserve pool that can be dipped into > if we run out of delalloc reservation. If the reserve pool is then > fully depleted before all delalloc conversion completes, we'll still > give ENOSPC. The pool is sized such that the vast majority of > workloads will complete delalloc conversion successfully before the > pool is depleted. > > Hence XFS gives everyone the -appearance- that it deals nicely with > ENOSPC conditions, but it never provides a -guarantee- that any > accepted write will always succeed without ENOSPC. > > IMO, using this "close-but-not-quite" reservation as the basis of > space requirements for other layers to provide "won't ENOSPC" > guarantees is fraught with problems. We already know that it is > insufficient in important corner cases at the filesystem level, and > we also know that lower layers trying to do ephemeral space > reservations will have exactly the same problems providing a > guarantee. And these are problems we've been unable to engineer > around in the past, so the likelihood we can engineer around them > now or in the future is also very unlikely. Thanks for clarifying. Wasn't aware of XFS delalloc's "wet finger in the air" ;) So do you think it reasonable to require applications to fallocate their data files? Unaware if users are aware to take that extra step. > > -- so adding coordination between XFS > > and dm-thin layers provides comparable safety.. that safety is an > > expected norm). > > > > But rather than discuss in terms of data vs metadata, the distinction > > is: > > 1) LBA range reservation (normal case, your proposal) > > 2) non-LBA reservation (absolute value, LBA range is known at later stage) > > > > But I'm clearly going off script for dwelling on wanting to handle > > both. > > Right, because if we do 1) then we don't need 2). :) Sure. > > My looking at (ab)using REQ_META being set (use 1) vs not (use 2) was > > a crude simplification for branching between the 2 approaches. > > > > And I understand I made you nervous by expanding the scope to a much > > more muddled/shitty interface. ;) > > Nervous? No, I'm simply trying to make sure that everyone is on the > same page. i.e. that if we water down the guarantee that 1) relies > on, then it's not actually useful to filesystems at all. Yeah, makes sense. > > > Put simply: if we restrict REQ_OP_PROVISION guarantees to just > > > REQ_META writes (or any other specific type of write operation) then > > > it's simply not worth persuing at the filesystem level because the > > > guarantees we actually need just aren't there and the complexity of > > > discovering and handling those corner cases just isn't worth the > > > effort. > > > > Here is where I get to say: I think you misunderstood me (but it was > > my fault for not being absolutely clear: I'm very much on the same > > page as you and Joe; and your visions need to just be implemented > > ASAP). > > OK, good that we've clarified the misunderstandings on both sides > quickly :) Do you think you're OK to scope out, and/or implement, the XFS changes if you use v7 of this patchset as the starting point? (v8 should just be v7 minus the dm-thin.c and dm-snap.c changes). The thinp support in v7 will work enough to allow XFS to issue REQ_OP_PROVISION and/or fallocate (via mkfs.xfs) to dm-thin devices. And Joe and I can make independent progress on the dm-thin.c changes needed to ensure the REQ_OP_PROVISION gaurantee you need. Thanks, Mike
Dave, > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but that's > what I intended - the operation does not contain data at all. It's an > operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it contains a > range of sectors that need to be provisioned (or discarded), and > nothing else. Yep. That's also how SCSI defines it. The act of provisioning a block range is done through an UNMAP command using a special flag. All it does is pin down those LBAs so future writes to them won't result in ENOSPC.
On Wed, Jun 07, 2023 at 10:03:40PM -0400, Martin K. Petersen wrote: > > Dave, > > > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but that's > > what I intended - the operation does not contain data at all. It's an > > operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it contains a > > range of sectors that need to be provisioned (or discarded), and > > nothing else. > > Yep. That's also how SCSI defines it. The act of provisioning a block > range is done through an UNMAP command using a special flag. All it does > is pin down those LBAs so future writes to them won't result in ENOSPC. *nod* That I knew, and it's one of the reasons I'd like the filesystem <-> block layer provisioning model to head in this direction. i.e. we don't have to do anything special to enable routing of provisioning requests to hardware and/or remote block storage devices (e.g. ceph-rbd, nbd, etc). Hence "external" devices can provide the same guarantees as a native software-only block device implementations like dm-thinp can provide and everything gets just that little bit better behaved... Cheers, Dave.
On Wed, Jun 07, 2023 at 07:50:25PM -0400, Mike Snitzer wrote: > Do you think you're OK to scope out, and/or implement, the XFS changes > if you use v7 of this patchset as the starting point? (v8 should just > be v7 minus the dm-thin.c and dm-snap.c changes). The thinp > support in v7 will work enough to allow XFS to issue REQ_OP_PROVISION > and/or fallocate (via mkfs.xfs) to dm-thin devices. Yup, XFS only needs blkdev_issue_provision() and bdev_max_provision_sectors() to be present. filesystem code. The initial XFS provisioning detection and fallocate() support is just under 50 lines of new code... Cheers, Dave.
On Wed, Jun 07 2023 at 7:27P -0400, Mike Snitzer <snitzer@kernel.org> wrote: > On Mon, Jun 05 2023 at 5:14P -0400, > Sarthak Kukreti <sarthakkukreti@chromium.org> wrote: > > > On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer <snitzer@kernel.org> wrote: > > > > > > We all just need to focus on your proposal and Joe's dm-thin > > > reservation design... > > > > > > [Sarthak: FYI, this implies that it doesn't really make sense to add > > > dm-thinp support before Joe's design is implemented. Otherwise we'll > > > have 2 different responses to REQ_OP_PROVISION. The one that is > > > captured in your patchset isn't adequate to properly handle ensuring > > > upper layer (like XFS) can depend on the space being available across > > > snapshot boundaries.] > > > > > Ack. Would it be premature for the rest of the series to go through > > (REQ_OP_PROVISION + support for loop and non-dm-thinp device-mapper > > targets)? I'd like to start using this as a reference to suggest > > additions to the virtio-spec for virtio-blk support and start looking > > at what an ext4 implementation would look like. > > Please drop the dm-thin.c and dm-snap.c changes. dm-snap.c would need > more work to provide the type of guarantee XFS requires across > snapshot boundaries. I'm inclined to _not_ add dm-snap.c support > because it is best to just use dm-thin. > > And FYI even your dm-thin patch will be the starting point for the > dm-thin support (we'll keep attribution to you for all the code in a > separate patch). > > > Fair points, I certainly don't want to derail this conversation; I'd > > be happy to see this work merged sooner rather than later. > > Once those dm target changes are dropped I think the rest of the > series is fine to go upstream now. Feel free to post a v8. FYI, I've made my latest code available in this 'dm-6.5-provision-support' branch (based on 'dm-6.5'): https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-6.5-provision-support It's what v8 should be plus the 2 dm-thin patches (that I don't think should go upstream yet, but are theoretically useful for Dave and Joe). The "dm thin: complete interface for REQ_OP_PROVISION support" commit establishes all the dm-thin interface I think is needed. The FIXME in process_provision_bio() (and the patch header) cautions against upper layers like XFS using this dm-thinp support quite yet. Otherwise we'll have the issue where dm-thinp's REQ_OP_PROVISION support initially doesn't provide the guarantee that XFS needs across snapshots (which is: snapshots inherit all previous REQ_OP_PROVISION). Mike
On Fri, Jun 09, 2023 at 04:31:41PM -0400, Mike Snitzer wrote: > On Wed, Jun 07 2023 at 7:27P -0400, > Mike Snitzer <snitzer@kernel.org> wrote: > > > On Mon, Jun 05 2023 at 5:14P -0400, > > Sarthak Kukreti <sarthakkukreti@chromium.org> wrote: > > > > > On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer <snitzer@kernel.org> wrote: > > > > > > > > We all just need to focus on your proposal and Joe's dm-thin > > > > reservation design... > > > > > > > > [Sarthak: FYI, this implies that it doesn't really make sense to add > > > > dm-thinp support before Joe's design is implemented. Otherwise we'll > > > > have 2 different responses to REQ_OP_PROVISION. The one that is > > > > captured in your patchset isn't adequate to properly handle ensuring > > > > upper layer (like XFS) can depend on the space being available across > > > > snapshot boundaries.] > > > > > > > Ack. Would it be premature for the rest of the series to go through > > > (REQ_OP_PROVISION + support for loop and non-dm-thinp device-mapper > > > targets)? I'd like to start using this as a reference to suggest > > > additions to the virtio-spec for virtio-blk support and start looking > > > at what an ext4 implementation would look like. > > > > Please drop the dm-thin.c and dm-snap.c changes. dm-snap.c would need > > more work to provide the type of guarantee XFS requires across > > snapshot boundaries. I'm inclined to _not_ add dm-snap.c support > > because it is best to just use dm-thin. > > > > And FYI even your dm-thin patch will be the starting point for the > > dm-thin support (we'll keep attribution to you for all the code in a > > separate patch). > > > > > Fair points, I certainly don't want to derail this conversation; I'd > > > be happy to see this work merged sooner rather than later. > > > > Once those dm target changes are dropped I think the rest of the > > series is fine to go upstream now. Feel free to post a v8. > > FYI, I've made my latest code available in this > 'dm-6.5-provision-support' branch (based on 'dm-6.5'): > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-6.5-provision-support > > It's what v8 should be plus the 2 dm-thin patches (that I don't think > should go upstream yet, but are theoretically useful for Dave and > Joe). > > The "dm thin: complete interface for REQ_OP_PROVISION support" commit > establishes all the dm-thin interface I think is needed. The FIXME in > process_provision_bio() (and the patch header) cautions against upper > layers like XFS using this dm-thinp support quite yet. > > Otherwise we'll have the issue where dm-thinp's REQ_OP_PROVISION > support initially doesn't provide the guarantee that XFS needs across > snapshots (which is: snapshots inherit all previous REQ_OP_PROVISION). Just tag it with EXPERIMENTAL on recpetion of the first REQ_OP_PROVISION for the device (i.e. dump a log warning), like I'll end up doing with XFS when it detects provisioning support at mount time. We do this all the time to allow merging new features before they are fully production ready - EXPERIMENTAL means you can expect it to mostly work, except when it doesn't, and you know that when it breaks you get to report the bug, help triage it and as a bonus you get to keep the broken bits! $ git grep EXPERIMENTAL fs/xfs fs/xfs/Kconfig: This feature is considered EXPERIMENTAL. Use with caution! fs/xfs/Kconfig: This feature is considered EXPERIMENTAL. Use with caution! fs/xfs/scrub/scrub.c: "EXPERIMENTAL online scrub feature in use. Use at your own risk!"); fs/xfs/xfs_fsops.c: "EXPERIMENTAL online shrink feature in use. Use at your own risk!"); fs/xfs/xfs_super.c: xfs_warn(mp, "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"); fs/xfs/xfs_super.c: "EXPERIMENTAL Large extent counts feature in use. Use at your own risk!"); fs/xfs/xfs_xattr.c: "EXPERIMENTAL logged extended attributes feature in use. Use at your own risk!"); $ IOWs, I'll be adding a: "EXPERIMENTAL block device provisioning in use. Use at your own risk!" warning to XFS, and it won't get removed until both the XFS and dm-thinp support is solid and production ready.... Cheers, Dave.
On Fri, Jun 9, 2023 at 1:31 PM Mike Snitzer <snitzer@redhat.com> wrote: > > On Wed, Jun 07 2023 at 7:27P -0400, > Mike Snitzer <snitzer@kernel.org> wrote: > > > On Mon, Jun 05 2023 at 5:14P -0400, > > Sarthak Kukreti <sarthakkukreti@chromium.org> wrote: > > > > > On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer <snitzer@kernel.org> wrote: > > > > > > > > We all just need to focus on your proposal and Joe's dm-thin > > > > reservation design... > > > > > > > > [Sarthak: FYI, this implies that it doesn't really make sense to add > > > > dm-thinp support before Joe's design is implemented. Otherwise we'll > > > > have 2 different responses to REQ_OP_PROVISION. The one that is > > > > captured in your patchset isn't adequate to properly handle ensuring > > > > upper layer (like XFS) can depend on the space being available across > > > > snapshot boundaries.] > > > > > > > Ack. Would it be premature for the rest of the series to go through > > > (REQ_OP_PROVISION + support for loop and non-dm-thinp device-mapper > > > targets)? I'd like to start using this as a reference to suggest > > > additions to the virtio-spec for virtio-blk support and start looking > > > at what an ext4 implementation would look like. > > > > Please drop the dm-thin.c and dm-snap.c changes. dm-snap.c would need > > more work to provide the type of guarantee XFS requires across > > snapshot boundaries. I'm inclined to _not_ add dm-snap.c support > > because it is best to just use dm-thin. > > > > And FYI even your dm-thin patch will be the starting point for the > > dm-thin support (we'll keep attribution to you for all the code in a > > separate patch). > > > > > Fair points, I certainly don't want to derail this conversation; I'd > > > be happy to see this work merged sooner rather than later. > > > > Once those dm target changes are dropped I think the rest of the > > series is fine to go upstream now. Feel free to post a v8. > > FYI, I've made my latest code available in this > 'dm-6.5-provision-support' branch (based on 'dm-6.5'): > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-6.5-provision-support > > It's what v8 should be plus the 2 dm-thin patches (that I don't think > should go upstream yet, but are theoretically useful for Dave and > Joe). > Cheers! Apologies for dropping the ball on this, I just sent out v8 with the dm-thin patches dropped. - Sarthak > The "dm thin: complete interface for REQ_OP_PROVISION support" commit > establishes all the dm-thin interface I think is needed. The FIXME in > process_provision_bio() (and the patch header) cautions against upper > layers like XFS using this dm-thinp support quite yet. > > Otherwise we'll have the issue where dm-thinp's REQ_OP_PROVISION > support initially doesn't provide the guarantee that XFS needs across > snapshots (which is: snapshots inherit all previous REQ_OP_PROVISION). > > Mike >