Message ID | 20230929102726.2985188-18-john.g.garry@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block atomic writes | expand |
On Fri, Sep 29, 2023 at 10:27:22AM +0000, John Garry wrote: > Ensure that when creating a mapping that we adhere to all the atomic > write rules. > > We check that the mapping covers the complete range of the write to ensure > that we'll be just creating a single mapping. > > Currently minimum granularity is the FS block size, but it should be > possibly to support lower in future. I really dislike how this forces aligned allocations. Aligned allocations are a nice optimization to offload some of the work to the storage hard/firmware, but we need to support it in general. And I think with out of place writes into the COW fork, and atomic transactions to swap it in we can do that pretty easily. That should also allow to get rid of the horrible forcealign mode, as we can still try align if possible and just fall back to the out of place writes.
On 09/11/2023 15:26, Christoph Hellwig wrote: > On Fri, Sep 29, 2023 at 10:27:22AM +0000, John Garry wrote: >> Ensure that when creating a mapping that we adhere to all the atomic >> write rules. >> >> We check that the mapping covers the complete range of the write to ensure >> that we'll be just creating a single mapping. >> >> Currently minimum granularity is the FS block size, but it should be >> possibly to support lower in future. > I really dislike how this forces aligned allocations. Aligned > allocations are a nice optimization to offload some of the work > to the storage hard/firmware, but we need to support it in general. > And I think with out of place writes into the COW fork, and atomic > transactions to swap it in we can do that pretty easily. > > That should also allow to get rid of the horrible forcealign mode, > as we can still try align if possible and just fall back to the > out of place writes. > > How could we try to align? Do you mean that we try to align up to some stage in the block allocator search? That seems like some middle ground between no alignment and forcealign. And what would we be aligning to? Thanks, John
Hi Christoph, >>> >>> Currently minimum granularity is the FS block size, but it should be >>> possibly to support lower in future. >> I really dislike how this forces aligned allocations. Aligned >> allocations are a nice optimization to offload some of the work >> to the storage hard/firmware, but we need to support it in general. >> And I think with out of place writes into the COW fork, and atomic >> transactions to swap it in we can do that pretty easily. >> >> That should also allow to get rid of the horrible forcealign mode, >> as we can still try align if possible and just fall back to the >> out of place writes. >> Can you try to explain your idea a bit more? This is blocking us. Are you suggesting some sort of hybrid between the atomic write series you had a few years ago and this solution? To me that would be continuing with the following: - per-IO RWF_ATOMIC (and not O_ATOMIC semantics of nothing is written until some data sync) - writes must be a power-of-two and at a naturally-aligned offset - relying on atomic write HW support always But for extents which are misaligned, we CoW to a new extent? I suppose we would align that extent to alignment of the write (i.e. length of write). BTW, we also have rtvol support which does not use forcealign as it already can guarantee alignment, but still does rely on the same principle of requiring alignment - would you want CoW support there also? Thanks, John
On Tue, Nov 28, 2023 at 08:56:37AM +0000, John Garry wrote: > Are you suggesting some sort of hybrid between the atomic write series you > had a few years ago and this solution? Very roughly, yes. > To me that would be continuing with the following: > - per-IO RWF_ATOMIC (and not O_ATOMIC semantics of nothing is written until > some data sync) Yes. > - writes must be a power-of-two and at a naturally-aligned offset Where offset is offset in the file? It would not require it. You probably want to do it for optimal performance, but requiring it feeels rather limited. > - relying on atomic write HW support always And I think that's where we have different opinions. I think the hw offload is a nice optimization and we should use it wherever we can. But building the entire userspace API around it feels like a mistake. > BTW, we also have rtvol support which does not use forcealign as it already > can guarantee alignment, but still does rely on the same principle of > requiring alignment - would you want CoW support there also? Upstream doesn't have out of place write support for the RT subvolume yet. But Darrick has a series for it and we're actively working on upstreaming it.
On 28/11/2023 13:56, Christoph Hellwig wrote: > On Tue, Nov 28, 2023 at 08:56:37AM +0000, John Garry wrote: >> Are you suggesting some sort of hybrid between the atomic write series you >> had a few years ago and this solution? > Very roughly, yes. > >> To me that would be continuing with the following: >> - per-IO RWF_ATOMIC (and not O_ATOMIC semantics of nothing is written until >> some data sync) > Yes. > >> - writes must be a power-of-two and at a naturally-aligned offset > Where offset is offset in the file? ok, fine, it would not be required for XFS with CoW. Some concerns still: a. device atomic write boundary, if any b. other FSes which do not have CoW support. ext4 is already being used for "atomic writes" in the field - see dubious amazon torn-write prevention. About b., we could add the pow-of-2 and file offset alignment requirement for other FSes, but then need to add some method to advertise that restriction. > It would not require it. You > probably want to do it for optimal performance, but requiring it > feeels rather limited. > >> - relying on atomic write HW support always > And I think that's where we have different opinions. I'm just trying to understand your idea and that is not necessarily my final opinion. > I think the hw > offload is a nice optimization and we should use it wherever we can. Sure, but to me it is a concern that we have 2x paths to make robust a. offload via hw, which may involve CoW b. no HW support, i.e. CoW always And for no HW support, if we don't follow the O_ATOMIC model of committing nothing until a SYNC is issued, would we allocate, write, and later free a new extent for each write, right? > But building the entire userspace API around it feels like a mistake. > ok, but FWIW it works for the usecases which we know. >> BTW, we also have rtvol support which does not use forcealign as it already >> can guarantee alignment, but still does rely on the same principle of >> requiring alignment - would you want CoW support there also? > Upstream doesn't have out of place write support for the RT subvolume > yet. But Darrick has a series for it and we're actively working on > upstreaming it. Yeah, I thought that I heard this. Thanks, John
> b. other FSes which do not have CoW support. ext4 is already being > used for "atomic writes" in the field We also need raw block device access to work within the constraints required by the hardware. >> probably want to do it for optimal performance, but requiring it >> feeels rather limited. The application developers we are working with generally prefer an error when things are not aligned properly. Predictable performance is key. Removing the performance variability of doing double writes is the reason for supporting atomics in the first place. I think there is value in providing a more generic (file-centric) atomic user API. And I think the I/O stack plumbing we provide would be useful in supporting such an endeavor. But I am not convinced that atomic operations in general should be limited to the couple of filesystems that can do CoW.
On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote: > ok, fine, it would not be required for XFS with CoW. Some concerns still: > a. device atomic write boundary, if any > b. other FSes which do not have CoW support. ext4 is already being used for > "atomic writes" in the field - see dubious amazon torn-write prevention. What is the 'dubious amazon torn-write prevention'? > About b., we could add the pow-of-2 and file offset alignment requirement > for other FSes, but then need to add some method to advertise that > restriction. We really need a better way to communicate I/O limitations anyway. Something like XFS_IOC_DIOINFO on steroids. > Sure, but to me it is a concern that we have 2x paths to make robust a. > offload via hw, which may involve CoW b. no HW support, i.e. CoW always Relying just on the hardware seems very limited, especially as there is plenty of hardware that won't guarantee anything larger than 4k, and plenty of NVMe hardware without has some other small limit like 32k because it doesn't support multiple atomicy mode. > And for no HW support, if we don't follow the O_ATOMIC model of committing > nothing until a SYNC is issued, would we allocate, write, and later free a > new extent for each write, right? Yes. Then again if you do data journalling you do that anyway, and as one little project I'm doing right now shows that data journling is often the fastest thing we can do for very small writes.
On 04/12/2023 13:45, Christoph Hellwig wrote: > On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote: >> ok, fine, it would not be required for XFS with CoW. Some concerns still: >> a. device atomic write boundary, if any >> b. other FSes which do not have CoW support. ext4 is already being used for >> "atomic writes" in the field - see dubious amazon torn-write prevention. > > What is the 'dubious amazon torn-write prevention'? https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html AFAICS, this is without any kernel changes, so no guarantee of unwanted splitting or merging of bios. Anyway, there will still be !CoW FSes which people want to support. > >> About b., we could add the pow-of-2 and file offset alignment requirement >> for other FSes, but then need to add some method to advertise that >> restriction. > > We really need a better way to communicate I/O limitations anyway. > Something like XFS_IOC_DIOINFO on steroids. > >> Sure, but to me it is a concern that we have 2x paths to make robust a. >> offload via hw, which may involve CoW b. no HW support, i.e. CoW always > > Relying just on the hardware seems very limited, especially as there is > plenty of hardware that won't guarantee anything larger than 4k, and > plenty of NVMe hardware without has some other small limit like 32k > because it doesn't support multiple atomicy mode. So what would you propose as the next step? Would it to be first achieve atomic write support for XFS with HW support + CoW to ensure contiguous extents (and without XFS forcealign)? > >> And for no HW support, if we don't follow the O_ATOMIC model of committing >> nothing until a SYNC is issued, would we allocate, write, and later free a >> new extent for each write, right? > > Yes. Then again if you do data journalling you do that anyway, and as > one little project I'm doing right now shows that data journling is > often the fastest thing we can do for very small writes. Ignoring FSes, then how is this supposed to work for block devices? We just always need HW support, right? Thanks, John
On Mon, Dec 04, 2023 at 03:19:15PM +0000, John Garry wrote: > On 04/12/2023 13:45, Christoph Hellwig wrote: >> On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote: >>> ok, fine, it would not be required for XFS with CoW. Some concerns still: >>> a. device atomic write boundary, if any >>> b. other FSes which do not have CoW support. ext4 is already being used for >>> "atomic writes" in the field - see dubious amazon torn-write prevention. >> >> What is the 'dubious amazon torn-write prevention'? > > https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html > > AFAICS, this is without any kernel changes, so no guarantee of unwanted > splitting or merging of bios. > > Anyway, there will still be !CoW FSes which people want to support. Ugg, so they badly reimplement NVMe atomic write support and use it without software stack enablement. Calling it dubious is way to gentle.. >> Relying just on the hardware seems very limited, especially as there is >> plenty of hardware that won't guarantee anything larger than 4k, and >> plenty of NVMe hardware without has some other small limit like 32k >> because it doesn't support multiple atomicy mode. > > So what would you propose as the next step? Would it to be first achieve > atomic write support for XFS with HW support + CoW to ensure contiguous > extents (and without XFS forcealign)? I think the very first priority is just block device support without any fs enablement. We just need to make sure the API isn't too limited for additional use cases. > Ignoring FSes, then how is this supposed to work for block devices? We just > always need HW support, right? Yes.
On 04/12/2023 15:39, Christoph Hellwig wrote: >> So what would you propose as the next step? Would it to be first achieve >> atomic write support for XFS with HW support + CoW to ensure contiguous >> extents (and without XFS forcealign)? > I think the very first priority is just block device support without > any fs enablement. We just need to make sure the API isn't too limited > for additional use cases. Sounds ok
On Mon, Dec 04, 2023 at 03:19:15PM +0000, John Garry wrote: > > > > What is the 'dubious amazon torn-write prevention'? > > https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html > > AFAICS, this is without any kernel changes, so no guarantee of unwanted > splitting or merging of bios. Well, more than one company has audited the kernel paths, and it turns out that for selected Kernel versions, after doing desk-check verification of the relevant kernel baths, as well as experimental verification via testing to try to find torn writes in the kernel, we can make it safe for specific kernel versions which might be used in hosted MySQL instances where we control the kernel, the mysql server, and the emulated block device (and we know the database is doing Direct I/O writes --- this won't work for PostgreSQL). I gave a talk about this at Google I/O Next '18, five years ago[1]. [1] https://www.youtube.com/watch?v=gIeuiGg-_iw Given the performance gains (see the talk (see the comparison of the at time 19:31 and at 29:57) --- it's quite compelling. Of course, I wouldn't recommend this approach for a naive sysadmin, since most database adminsitrators won't know how to audit kernel code (see the discussion at time 35:10 of the video), and reverify the entire software stack before every kernel upgrade. The challenge is how to do this safely. The fact remains that both Amazon's EBS and Google's Persistent Disk products are implemented in such a way that writes will not be torn below the virtual machine, and the guarantees are in fact quite a bit stronger than what we will probably end up advertising via NVMe and/or SCSI. It wouldn't surprise me if this is the case (or could be made to be the case) For Oracle Cloud as well. The question is how to make this guarantee so that the kernel knows when various cloud-provided block devicse do provide these greater guarantees, and then how to make it be an architected feature, as opposed to a happy implementation detail that has to be verified at every kernel upgrade. Cheers, - Ted
On 05/12/2023 04:55, Theodore Ts'o wrote: >> AFAICS, this is without any kernel changes, so no guarantee of unwanted >> splitting or merging of bios. > Well, more than one company has audited the kernel paths, and it turns > out that for selected Kernel versions, after doing desk-check > verification of the relevant kernel baths, as well as experimental > verification via testing to try to find torn writes in the kernel, we > can make it safe for specific kernel versions which might be used in > hosted MySQL instances where we control the kernel, the mysql server, > and the emulated block device (and we know the database is doing > Direct I/O writes --- this won't work for PostgreSQL). I gave a talk > about this at Google I/O Next '18, five years ago[1]. > > [1]https://urldefense.com/v3/__https://www.youtube.com/watch?v=gIeuiGg-_iw__;!!ACWV5N9M2RV99hQ!I4iRp4xUyzAT0UwuEcnUBBCPKLXFKfk5FNmysFbKcQYfl0marAll5xEEVyB5mMFDqeckCWLmjU1aCR2Z$ > > Given the performance gains (see the talk (see the comparison of the > at time 19:31 and at 29:57) --- it's quite compelling. > > Of course, I wouldn't recommend this approach for a naive sysadmin, > since most database adminsitrators won't know how to audit kernel code > (see the discussion at time 35:10 of the video), and reverify the > entire software stack before every kernel upgrade. Sure > The challenge is > how to do this safely. Right, and that is why I would be concerned about advertising torn-write protection support, but someone has not gone through the effort of auditing and verification phase to ensure that this does not happen in their software stack ever. > > The fact remains that both Amazon's EBS and Google's Persistent Disk > products are implemented in such a way that writes will not be torn > below the virtual machine, and the guarantees are in fact quite a bit > stronger than what we will probably end up advertising via NVMe and/or > SCSI. It wouldn't surprise me if this is the case (or could be made > to be the case) For Oracle Cloud as well. > > The question is how to make this guarantee so that the kernel knows > when various cloud-provided block devicse do provide these greater > guarantees, and then how to make it be an architected feature, as > opposed to a happy implementation detail that has to be verified at > every kernel upgrade. The kernel can only judge atomic write support from what the HW product data tells us, so cloud-provided block devices need to provide that information as best possible if emulating the some storage technology. Thanks, John
On Mon, Dec 04, 2023 at 03:19:15PM +0000, John Garry wrote: > On 04/12/2023 13:45, Christoph Hellwig wrote: > > On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote: > > > ok, fine, it would not be required for XFS with CoW. Some concerns still: > > > a. device atomic write boundary, if any > > > b. other FSes which do not have CoW support. ext4 is already being used for > > > "atomic writes" in the field - see dubious amazon torn-write prevention. > > > > What is the 'dubious amazon torn-write prevention'? > > https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html > > AFAICS, this is without any kernel changes, so no guarantee of unwanted > splitting or merging of bios. > > Anyway, there will still be !CoW FSes which people want to support. > > > > > > About b., we could add the pow-of-2 and file offset alignment requirement > > > for other FSes, but then need to add some method to advertise that > > > restriction. > > > > We really need a better way to communicate I/O limitations anyway. > > Something like XFS_IOC_DIOINFO on steroids. > > > > > Sure, but to me it is a concern that we have 2x paths to make robust a. > > > offload via hw, which may involve CoW b. no HW support, i.e. CoW always > > > > Relying just on the hardware seems very limited, especially as there is > > plenty of hardware that won't guarantee anything larger than 4k, and > > plenty of NVMe hardware without has some other small limit like 32k > > because it doesn't support multiple atomicy mode. > > So what would you propose as the next step? Would it to be first achieve > atomic write support for XFS with HW support + CoW to ensure contiguous > extents (and without XFS forcealign)? > > > > > > And for no HW support, if we don't follow the O_ATOMIC model of committing > > > nothing until a SYNC is issued, would we allocate, write, and later free a > > > new extent for each write, right? > > > > Yes. Then again if you do data journalling you do that anyway, and as > > one little project I'm doing right now shows that data journling is > > often the fastest thing we can do for very small writes. > > Ignoring FSes, then how is this supposed to work for block devices? We just > always need HW support, right? Looks the HW support could be minimized, just like what Google and Amazon did, 16KB physical block size with proper queue limit setting. Now seems it is easy to make such device with ublk-loop by: - use one backing disk with 16KB/32KB/.. physical block size - expose proper physical bs & chunk_sectors & max sectors queue limit Then any 16KB aligned direct WRITE with N*16KB length(N in [1, 8] with 256 chunk_sectors) can be atomic-write. Thanks, Ming
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 70fe873951f3..3424fcfc04f5 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -783,6 +783,7 @@ xfs_direct_write_iomap_begin( { struct xfs_inode *ip = XFS_I(inode); struct xfs_mount *mp = ip->i_mount; + struct xfs_sb *m_sb = &mp->m_sb; struct xfs_bmbt_irec imap, cmap; xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, length); @@ -814,6 +815,41 @@ xfs_direct_write_iomap_begin( if (error) goto out_unlock; + if (flags & IOMAP_ATOMIC_WRITE) { + xfs_filblks_t unit_min_fsb, unit_max_fsb; + + xfs_ip_atomic_write_attr(ip, &unit_min_fsb, &unit_max_fsb); + + if (!imap_spans_range(&imap, offset_fsb, end_fsb)) { + error = -EIO; + goto out_unlock; + } + + if (offset % m_sb->sb_blocksize || + length % m_sb->sb_blocksize) { + error = -EIO; + goto out_unlock; + } + + if (imap.br_blockcount == unit_min_fsb || + imap.br_blockcount == unit_max_fsb) { + /* min and max must be a power-of-2 */ + } else if (imap.br_blockcount < unit_min_fsb || + imap.br_blockcount > unit_max_fsb) { + error = -EIO; + goto out_unlock; + } else if (!is_power_of_2(imap.br_blockcount)) { + error = -EIO; + goto out_unlock; + } + + if (imap.br_startoff && + imap.br_startoff % imap.br_blockcount) { + error = -EIO; + goto out_unlock; + } + } + if (imap_needs_cow(ip, flags, &imap, nimaps)) { error = -EAGAIN; if (flags & IOMAP_NOWAIT)
Ensure that when creating a mapping that we adhere to all the atomic write rules. We check that the mapping covers the complete range of the write to ensure that we'll be just creating a single mapping. Currently minimum granularity is the FS block size, but it should be possibly to support lower in future. Signed-off-by: John Garry <john.g.garry@oracle.com> --- fs/xfs/xfs_iomap.c | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+)