Message ID | 20230929102726.2985188-11-john.g.garry@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block atomic writes | expand |
On 9/29/23 03:27, John Garry wrote: > + if (pos % atomic_write_unit_min_bytes) > + return false; > + if (iov_iter_count(iter) % atomic_write_unit_min_bytes) > + return false; > + if (!is_power_of_2(iov_iter_count(iter))) > + return false; [ ... ] > + if (pos % iov_iter_count(iter)) > + return false; Where do these rules come from? Is there any standard that requires any of the above? Thanks, Bart.
On 29/09/2023 18:51, Bart Van Assche wrote: > On 9/29/23 03:27, John Garry wrote: >> + if (pos % atomic_write_unit_min_bytes) >> + return false; >> + if (iov_iter_count(iter) % atomic_write_unit_min_bytes) >> + return false; >> + if (!is_power_of_2(iov_iter_count(iter))) >> + return false; > [ ... ] >> + if (pos % iov_iter_count(iter)) >> + return false; > > Where do these rules come from? Is there any standard that requires > any of the above? SCSI and NVMe have slightly different atomic writes semantics, and the rules are created to work for both. In addition, the rules are related to FS extent alignment. Note that for simplicity and consistency we use the same rules for regular files as for bdev's. This is the coding for the rules and where they come from: > + if (!atomic_write_unit_min_bytes) > + return false; If atomic_write_unit_min_bytes == 0, then we just don't support atomic writes. > + if (pos % atomic_write_unit_min_bytes) > + return false; See later rules. > + if (iov_iter_count(iter) % atomic_write_unit_min_bytes) > + return false; For SCSI, there is an atomic write granularity, which dictates atomic_write_unit_min_bytes. So here we need to ensure that the length is a multiple of this value. > + if (!is_power_of_2(iov_iter_count(iter))) > + return false; This rule comes from FS block alignment and NVMe atomic boundary. FSes (XFS) have discontiguous extents. We need to ensure that an atomic write does not cross discontiguous extents. To do this we ensure extent length and alignment and limit atomic_write_unit_max_bytes to that. For NVMe, an atomic write boundary is a boundary in LBA space which an atomic write should not cross. We limit atomic_write_unit_max_bytes such that it is evenly divisible into this atomic write boundary. To ensure that the write does not cross these alignment boundaries we say that it must be naturally aligned and a power-of-2 in length. We may be able to relax this rule but I am not sure it buys us anything - typically we want to be writing a 64KB block aligned to 64KB, for example. > + if (iov_iter_count(iter) > atomic_write_unit_max_bytes) > + return false; We just can't exceed this length. > + if (pos % iov_iter_count(iter)) > + return false; As above, ensure naturally aligned. Thanks, John
On 10/2/23 03:10, John Garry wrote: > On 29/09/2023 18:51, Bart Van Assche wrote: >> On 9/29/23 03:27, John Garry wrote: > > + if (pos % atomic_write_unit_min_bytes) > > + return false; > > See later rules. Is atomic_write_unit_min_bytes always equal to the logical block size? If so, can the above test be left out? > > + if (iov_iter_count(iter) % atomic_write_unit_min_bytes) > > + return false; > > For SCSI, there is an atomic write granularity, which dictates > atomic_write_unit_min_bytes. So here we need to ensure that the length > is a multiple of this value. Are there any SCSI devices that we care about that report an ATOMIC TRANSFER LENGTH GRANULARITY that is larger than a single logical block? I'm wondering whether we really have to support such devices. > > + if (!is_power_of_2(iov_iter_count(iter))) > > + return false; > > This rule comes from FS block alignment and NVMe atomic boundary. > > FSes (XFS) have discontiguous extents. We need to ensure that an atomic > write does not cross discontiguous extents. To do this we ensure extent > length and alignment and limit atomic_write_unit_max_bytes to that. > > For NVMe, an atomic write boundary is a boundary in LBA space which an > atomic write should not cross. We limit atomic_write_unit_max_bytes such > that it is evenly divisible into this atomic write boundary. > > To ensure that the write does not cross these alignment boundaries we > say that it must be naturally aligned and a power-of-2 in length. > > We may be able to relax this rule but I am not sure it buys us anything > - typically we want to be writing a 64KB block aligned to 64KB, for > example. It seems to me that the requirement is_power_of_2(iov_iter_count(iter)) is necessary for some filesystems but not for all filesystems. Restrictions that are specific to a single filesystem (XFS) should not occur in code that is intended to be used by all filesystems (blkdev_atomic_write_valid()). Thanks, Bart.
Bart, > Are there any SCSI devices that we care about that report an ATOMIC > TRANSFER LENGTH GRANULARITY that is larger than a single logical > block? Yes. Note that code path used inside a storage device to guarantee atomicity of an entire I/O may be substantially different from the code path which only offers an incremental guarantee at a single logical or physical block level (to the extent that those guarantees are offered at all but that's a different kettle of fish). > I'm wondering whether we really have to support such devices. Yes.
On 02/10/2023 20:12, Bart Van Assche wrote: >> > + if (!is_power_of_2(iov_iter_count(iter))) >> > + return false; >> >> This rule comes from FS block alignment and NVMe atomic boundary. >> >> FSes (XFS) have discontiguous extents. We need to ensure that an >> atomic write does not cross discontiguous extents. To do this we >> ensure extent length and alignment and limit >> atomic_write_unit_max_bytes to that. >> >> For NVMe, an atomic write boundary is a boundary in LBA space which an >> atomic write should not cross. We limit atomic_write_unit_max_bytes >> such that it is evenly divisible into this atomic write boundary. >> >> To ensure that the write does not cross these alignment boundaries we >> say that it must be naturally aligned and a power-of-2 in length. >> >> We may be able to relax this rule but I am not sure it buys us >> anything - typically we want to be writing a 64KB block aligned to >> 64KB, for example. > > It seems to me that the requirement is_power_of_2(iov_iter_count(iter)) > is necessary for some filesystems but not for all filesystems. > Restrictions that are specific to a single filesystem (XFS) should not > occur in code that is intended to be used by all filesystems > (blkdev_atomic_write_valid()). I don't think that is_power_of_2(write length) is specific to XFS. It is just a simple mathematical method to ensure we obey length and alignment requirement always. Furthermore, if ext4 wants to support atomic writes, for example, then it will probably base that on bigalloc. And bigalloc is power-of-2 based. As for the rules, current proposal is: - atomic_write_unit_min and atomic_write_unit_max are power-of-2 - write needs to be at a naturally aligned file offset - write length needs to be a power-of-2 between atomic_write_unit_min and atomic_write_unit_max, inclusive Those could be relaxed to: - atomic_write_unit_min and atomic_write_unit_max are power-of-2 - write length needs to be a multiple of atomic_write_unit_min and a max of atomic_write_unit_max - write needs to be at an offset aligned to atomic_write_unit_min - write cannot cross atomic_write_unit_max boundary within the file Are the relaxed rules better? I don't think so, and I don't like "write cannot cross atomic_write_unit_max boundary" in terms of wording. Thanks, John
On 10/3/23 01:37, John Garry wrote:
> I don't think that is_power_of_2(write length) is specific to XFS.
I think this is specific to XFS. Can you show me the F2FS code that
restricts the length of an atomic write to a power of two? I haven't
found it. The only power-of-two check that I found in F2FS is the
following (maybe I overlooked something):
$ git grep -nH is_power fs/f2fs
fs/f2fs/super.c:3914: if (!is_power_of_2(zone_sectors)) {
Thanks,
Bart.
On 10/2/23 17:48, Martin K. Petersen wrote: > > Bart, > >> Are there any SCSI devices that we care about that report an ATOMIC >> TRANSFER LENGTH GRANULARITY that is larger than a single logical >> block? > > Yes. > > Note that code path used inside a storage device to guarantee atomicity > of an entire I/O may be substantially different from the code path which > only offers an incremental guarantee at a single logical or physical > block level (to the extent that those guarantees are offered at all but > that's a different kettle of fish). > >> I'm wondering whether we really have to support such devices. > > Yes. Hi Martin, I'm still wondering whether we really should support storage devices that report an ATOMIC TRANSFER LENGTH GRANULARITY that is larger than the logical block size. Is my understanding correct that the NVMe specification makes it mandatory to support single logical block atomic writes since the smallest value that can be reported as the AWUN parameter is one logical block because this parameter is a 0's based value? Is my understanding correct that SCSI devices that report an ATOMIC TRANSFER LENGTH GRANULARITY that is larger than the logical block size are not able to support the NVMe protocol? From the NVMe specification section about the identify controller response: "Atomic Write Unit Normal (AWUN): This field indicates the size of the write operation guaranteed to be written atomically to the NVM across all namespaces with any supported namespace format during normal operation. This field is specified in logical blocks and is a 0’s based value." Thanks, Bart.
Bart, > I'm still wondering whether we really should support storage devices > that report an ATOMIC TRANSFER LENGTH GRANULARITY that is larger than > the logical block size. We should. The common case is that the device reports an ATOMIC TRANSFER LENGTH GRANULARITY matching the reported physical block size. I.e. a logical block size of 512 bytes and a physical block size of 4KB. In that scenario a write of a single logical block would require read-modify-write of a physical block. > Is my understanding correct that the NVMe specification makes it > mandatory to support single logical block atomic writes since the > smallest value that can be reported as the AWUN parameter is one > logical block because this parameter is a 0's based value? Is my > understanding correct that SCSI devices that report an ATOMIC TRANSFER > LENGTH GRANULARITY that is larger than the logical block size are not > able to support the NVMe protocol? That's correct. There are obviously things you can express in SCSI that you can't in NVMe. And the other way around. Our intent is to support both protocols.
On 03/10/2023 17:45, Bart Van Assche wrote: > On 10/3/23 01:37, John Garry wrote: >> I don't think that is_power_of_2(write length) is specific to XFS. > > I think this is specific to XFS. Can you show me the F2FS code that > restricts the length of an atomic write to a power of two? I haven't > found it. The only power-of-two check that I found in F2FS is the > following (maybe I overlooked something): > > $ git grep -nH is_power fs/f2fs > fs/f2fs/super.c:3914: if (!is_power_of_2(zone_sectors)) { Any usecases which we know of requires a power-of-2 block size. Do you know of a requirement for other sizes? Or are you concerned that it is unnecessarily restrictive? We have to deal with HW features like atomic write boundary and FS restrictions like extent and stripe alignment transparent, which are almost always powers-of-2, so naturally we would want to work with powers-of-2 for atomic write sizes. The power-of-2 stuff could be dropped if that is what people want. However we still want to provide a set of rules to the user to make those HW and FS features mentioned transparent to the user. Thanks, John
On 10/3/23 19:53, Martin K. Petersen wrote: > > Bart, > >> I'm still wondering whether we really should support storage >> devices that report an ATOMIC TRANSFER LENGTH GRANULARITY that is >> larger than the logical block size. > > We should. The common case is that the device reports an ATOMIC > TRANSFER LENGTH GRANULARITY matching the reported physical block > size. I.e. a logical block size of 512 bytes and a physical block > size of 4KB. In that scenario a write of a single logical block would > require read-modify-write of a physical block. Block devices must serialize read-modify-write operations internally that happen when there are multiple logical blocks per physical block. Otherwise it is not guaranteed that a READ command returns the most recently written data to the same LBA. I think we can ignore concurrent and overlapping writes in this discussion since these can be considered as bugs in host software. In other words, also for the above example it is guaranteed that writes of a single logical block (512 bytes) are atomic, no matter what value is reported as the ATOMIC TRANSFER LENGTH GRANULARITY. >> Is my understanding correct that the NVMe specification makes it >> mandatory to support single logical block atomic writes since the >> smallest value that can be reported as the AWUN parameter is one >> logical block because this parameter is a 0's based value? Is my >> understanding correct that SCSI devices that report an ATOMIC >> TRANSFER LENGTH GRANULARITY that is larger than the logical block >> size are not able to support the NVMe protocol? > > That's correct. There are obviously things you can express in SCSI > that you can't in NVMe. And the other way around. Our intent is to > support both protocols. How about aligning the features of the two protocols as much as possible? My understanding is that all long-term T10 contributors are all in favor of this. Thanks, Bart.
On 10/4/23 02:14, John Garry wrote: > On 03/10/2023 17:45, Bart Van Assche wrote: >> On 10/3/23 01:37, John Garry wrote: >>> I don't think that is_power_of_2(write length) is specific to XFS. >> >> I think this is specific to XFS. Can you show me the F2FS code that >> restricts the length of an atomic write to a power of two? I haven't >> found it. The only power-of-two check that I found in F2FS is the >> following (maybe I overlooked something): >> >> $ git grep -nH is_power fs/f2fs >> fs/f2fs/super.c:3914: if (!is_power_of_2(zone_sectors)) { > > Any usecases which we know of requires a power-of-2 block size. > > Do you know of a requirement for other sizes? Or are you concerned that > it is unnecessarily restrictive? > > We have to deal with HW features like atomic write boundary and FS > restrictions like extent and stripe alignment transparent, which are > almost always powers-of-2, so naturally we would want to work with > powers-of-2 for atomic write sizes. > > The power-of-2 stuff could be dropped if that is what people want. > However we still want to provide a set of rules to the user to make > those HW and FS features mentioned transparent to the user. Hi John, My concern is that the power-of-2 requirements are only needed for traditional filesystems and not for log-structured filesystems (BTRFS, F2FS, BCACHEFS). What I'd like to see is that each filesystem declares its atomic write requirements (in struct address_space_operations?) and that blkdev_atomic_write_valid() checks the filesystem-specific atomic write requirements. Thanks, Bart.
Hi Bart! > In other words, also for the above example it is guaranteed that > writes of a single logical block (512 bytes) are atomic, no matter > what value is reported as the ATOMIC TRANSFER LENGTH GRANULARITY. There is no formal guarantee that a disk drive sector read-modify-write operation results in a readable sector after a power failure. We have definitely seen blocks being mangled in the field. Wrt. supporting SCSI atomic block operations, the device rejects the WRITE ATOMIC(16) command if you attempt to use a transfer length smaller than the reported granularity. If we want to support WRITE ATOMIC(16) we have to abide by the values reported by the device. It is not optional. Besides, the whole point of this patch set is to increase the "observable atomic block size" beyond the physical block size to facilitate applications that prefer to use blocks in the 8-64KB range. IOW, using the logical block size is not particularly interesting. The objective is to prevent tearing of much larger blocks. > How about aligning the features of the two protocols as much as > possible? My understanding is that all long-term T10 contributors are > all in favor of this. That is exactly what this patch set does. Out of the 5-6 different "atomic" modes of operation permitted by SCSI and NVMe, our exposed semantics are carefully chosen to permit all compliant devices to be used. Based on only two reported queue limits (FWIW, we started with way more than that. I believe that complexity was part of the first RFC we posted). Whereas this series hides most of the complexity in the various unfortunate protocol quirks behind a simple interface: Your tear-proof writes can't be smaller than X bytes and larger than Y bytes and they must be naturally aligned. This simplified things substantially from an application perspective.
On Wed, Oct 04, 2023 at 10:34:13AM -0700, Bart Van Assche wrote: > On 10/4/23 02:14, John Garry wrote: > > On 03/10/2023 17:45, Bart Van Assche wrote: > > > On 10/3/23 01:37, John Garry wrote: > > > > I don't think that is_power_of_2(write length) is specific to XFS. > > > > > > I think this is specific to XFS. Can you show me the F2FS code that > > > restricts the length of an atomic write to a power of two? I haven't > > > found it. The only power-of-two check that I found in F2FS is the > > > following (maybe I overlooked something): > > > > > > $ git grep -nH is_power fs/f2fs > > > fs/f2fs/super.c:3914: if (!is_power_of_2(zone_sectors)) { > > > > Any usecases which we know of requires a power-of-2 block size. > > > > Do you know of a requirement for other sizes? Or are you concerned that > > it is unnecessarily restrictive? > > > > We have to deal with HW features like atomic write boundary and FS > > restrictions like extent and stripe alignment transparent, which are > > almost always powers-of-2, so naturally we would want to work with > > powers-of-2 for atomic write sizes. > > > > The power-of-2 stuff could be dropped if that is what people want. > > However we still want to provide a set of rules to the user to make > > those HW and FS features mentioned transparent to the user. > > Hi John, > > My concern is that the power-of-2 requirements are only needed for > traditional filesystems and not for log-structured filesystems (BTRFS, > F2FS, BCACHEFS). Filesystems that support copy-on-write data (needed for arbitrary filesystem block aligned RWF_ATOMIC support) are not necessarily log structured. For example: XFS. All three of the filesystems you list above still use power-of-2 block sizes for most of their metadata structures and for large data extents. Hence once you go above a certain file size they are going to be doing full power-of-2 block size aligned IO anyway. hence the constraint of atomic writes needing to be power-of-2 block size aligned to avoid RMW cycles doesn't really change for these filesystems. In which case, they can just set their minimum atomic IO size to be the same as their block size (e.g. 4kB) and set the maximum to something they can guarantee gets COW'd in a single atomic transaction. What the hardware can do with REQ_ATOMIC IO is completely irrelevant at this point.... > What I'd like to see is that each filesystem declares its atomic write > requirements (in struct address_space_operations?) and that > blkdev_atomic_write_valid() checks the filesystem-specific atomic write > requirements. That seems unworkable to me - IO constraints propagate from the bottom up, not from the top down. Consider multi-device filesystems (btrfs and XFS), where different devices might have different atomic write parameters. Which set of bdev parameters does the filesystem report to the querying bdev? (And doesn't that question just sound completely wrong?) It also doesn't work for filesystems that can configure extent allocation alignment at an individual inode level (like XFS) - what does the filesystem report to the device when it doesn't know what alignment constraints individual on-disk inodes might be using? That's why statx() vectors through filesystems to all them to set their own parameters based on the inode statx() is being called on. If the filesystem has a native RWF_ATOMIC implementation, it can put it's own parameters in the statx min/max atomic write size fields. If the fs doesn't have it's own native support, but can do physical file offset/LBA alignment, then it publishes the block device atomic support parameters or overrides them with it's internal allocation alignment constraints. If the bdev doesn't support REQ_ATOMIC, the filesystem says "atomic writes are not supported". -Dave.
On 10/4/23 11:17, Martin K. Petersen wrote: > > Hi Bart! > >> In other words, also for the above example it is guaranteed that >> writes of a single logical block (512 bytes) are atomic, no matter >> what value is reported as the ATOMIC TRANSFER LENGTH GRANULARITY. > > There is no formal guarantee that a disk drive sector > read-modify-write operation results in a readable sector after a > power failure. We have definitely seen blocks being mangled in the > field. Aren't block devices expected to use a capacitor that provides enough power to handle power failures cleanly? How about blacklisting block devices that mangle blocks if a power failure occurs? I think such block devices are not compatible with journaling filesystems nor with log-structured filesystems. Thanks, Bart.
On Thu, Oct 05, 2023 at 10:10:45AM -0700, Bart Van Assche wrote: > On 10/4/23 11:17, Martin K. Petersen wrote: > > > > Hi Bart! > > > > > In other words, also for the above example it is guaranteed that > > > writes of a single logical block (512 bytes) are atomic, no matter > > > what value is reported as the ATOMIC TRANSFER LENGTH GRANULARITY. > > > > There is no formal guarantee that a disk drive sector read-modify-write > > operation results in a readable sector after a power failure. We have > > definitely seen blocks being mangled in the field. > > Aren't block devices expected to use a capacitor that provides enough > power to handle power failures cleanly? Nope. Any block device that says it operates in writeback cache mode (i.e. almost every single consumer SATA and NVMe drive ever made) has a volatile write back cache and so does not provide any power fail data integrity guarantees. Simple to check, my less-than-1-yr-old workstation tells me: $ lspci |grep -i nvme 03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 $ cat /sys/block/nvme*n1/queue/write_cache write back write back $ That they have volatile writeback caches.... > How about blacklisting block devices that mangle blocks if a power > failure occurs? I think such block devices are not compatible with > journaling filesystems nor with log-structured filesystems. Statements like this from people working on storage hardware really worry me. It demonstrates a lack of understanding of how filesystems actually work, not to mention the fact that this architectural problem (i.e. handling volatile device write caches correctly) was solved in the Linux IO stack a couple of decades ago. This isn't even 'state of the art' knowledge - this is foundational knowlege that everyone working on storage should know. The tl;dr summary is that filesystems will issue a cache flush request (REQ_PREFLUSH) and/or write-through to stable storage semantics (REQ_FUA) for any data, metadata or journal IO that has data integrity and/or ordering requirements associated with it. The block layer will then do the most optimal correct thing with that request (e.g. ignore them for IO being directed at WC disabled devices), but it guarantees the flush/fua semantics for those IOs will be provided by all layers in the stack right down to the persistent storage media itself. Hence all the filesystem has to do is get it's IO and cache flush ordering correct, and everything just works regardless of the underlying storage capabilities. And, yes, any storage device with volatile caches that doesn't implement cache flushes correctly is considered broken and will get black listed.... -Dave.
On 10/5/23 15:36, Dave Chinner wrote: > $ lspci |grep -i nvme > 03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 > 06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 > $ cat /sys/block/nvme*n1/queue/write_cache > write back > write back > $ > > That they have volatile writeback caches.... It seems like what I wrote has been misunderstood completely. With "handling a power failure cleanly" I meant that power cycling a block device does not result in read errors nor in reading data that has never been written. Although it is hard to find information about this topic, here is what I found online: * About certain SSDs with power loss protection: https://us.transcend-info.com/embedded/technology/power-loss-protection-plp * About another class of SSDs with power loss protection: https://www.kingston.com/en/blog/servers-and-data-centers/ssd-power-loss-protection * About yet another class of SSDs with power loss protection: https://phisonblog.com/avoiding-ssd-data-loss-with-phisons-power-loss-protection-2/ So far I haven't found any information about hard disks and power failure handling. What I found is that most current hard disks protect data with ECC. The ECC mechanism should provide good protection against reading data that has never been written. If a power failure occurs while a hard disk is writing a physical block, can this result in a read error after power is restored? If so, is this behavior allowed by storage standards? Bart.
On Thu, Oct 05, 2023 at 03:58:38PM -0700, Bart Van Assche wrote: > On 10/5/23 15:36, Dave Chinner wrote: > > $ lspci |grep -i nvme > > 03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 > > 06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 > > $ cat /sys/block/nvme*n1/queue/write_cache > > write back > > write back > > $ > > > > That they have volatile writeback caches.... > > It seems like what I wrote has been misunderstood completely. With > "handling a power failure cleanly" I meant that power cycling a block device > does not result in read errors nor in reading data that has never been written. Then I don't see what your concern is. Single sector writes are guaranteed atomic and have been for as long as I've worked in this game. OTOH, multi-sector writes are not guaranteed to be atomic - they can get torn on sector boundaries, but the individual sectors within that write are guaranteed to be all-or-nothing. Any hardware device that does not guarantee single sector write atomicity (i.e. tears in the middle of a sector) is, by definition, broken. And we all know that broken hardware means nothing in the storage stack works as it should, so I just don't see what point you are trying to make... > Although it is hard to find information about this topic, here is what I found > online: > * About certain SSDs with power loss protection: > https://us.transcend-info.com/embedded/technology/power-loss-protection-plp > * About another class of SSDs with power loss protection: > https://www.kingston.com/en/blog/servers-and-data-centers/ssd-power-loss-protection > * About yet another class of SSDs with power loss protection: > https://phisonblog.com/avoiding-ssd-data-loss-with-phisons-power-loss-protection-2/ Yup, devices that behave as if they have non-volatile write caches. Such devices have been around for more than 30 years, they operate the same as devices without caches at all. > So far I haven't found any information about hard disks and power failure > handling. What I found is that most current hard disks protect data with ECC. > The ECC mechanism should provide good protection against reading data that > has never been written. If a power failure occurs while a hard disk is writing > a physical block, can this result in a read error after power is restored? If > so, is this behavior allowed by storage standards? If a power fail results in read errors from the storage media being reported to the OS instead of the data that was present in the sector before the power failure, then the device is broken. If there is no data in the region being read because it has never been written, then it should return zeros (no data) rather than stale data or a media read error. -Dave.
On 10/5/23 21:31, Dave Chinner wrote: > Then I don't see what your concern is. > > Single sector writes are guaranteed atomic and have been for as long > as I've worked in this game. OTOH, multi-sector writes are not > guaranteed to be atomic - they can get torn on sector boundaries, but > the individual sectors within that write are guaranteed to be > all-or-nothing. > > Any hardware device that does not guarantee single sector write > atomicity (i.e. tears in the middle of a sector) is, by definition, > broken. And we all know that broken hardware means nothing in the > storage stack works as it should, so I just don't see what point you > are trying to make... Do you agree that the above implies that it is not useful in patch 01/21 of this series to track atomic_write_unit_min_bytes in the block layer nor to export this information to user space? The above implies that this parameter will always be equal to the logical block size. Writes to a single physical block happen atomically. If there are multiple logical blocks per physical block, the block device must serialize read/modify/write cycles internally. Thanks, Bart.
Bart, > The above implies that this parameter will always be equal to the > logical block size. It does not. Being able to write each individual block in an I/O without tearing does not imply that a device can write two blocks as a single atomic operation. > Writes to a single physical block happen atomically. If there are > multiple logical blocks per physical block, the block device must > serialize read/modify/write cycles internally. This is what SBC has to say: "If any write command that is not an atomic write command, does not complete successfully (e.g., the command completed with CHECK CONDITION status, or the command was being processed at the time of a power loss or an incorrect demount of a removable medium), then any data in the logical blocks referenced by the LBAs specified by that command is indeterminate." SBC defines "atomic write command" like this: "An atomic write command performs one or more atomic write operations. The following write commands are atomic write commands: a) WRITE ATOMIC (16) (see 5.48); and b) WRITE ATOMIC (32) (see 5.49)." You will note that none of the regular WRITE commands appear in that list. Now, in practice we obviously rely heavily on the fact that most devices are implemented in a sane fashion which doesn't mess up individual logical blocks on power fail. But the spec does not guarantee this; it is device implementation dependent. And again, we have seen both hard disk drives and SSDs that cause collateral damage to an entire physical block when power is lost at the wrong time.
On Fri, Sep 29, 2023 at 10:27:15AM +0000, John Garry wrote: > Add support for atomic writes, as follows: > - Ensure that the IO follows all the atomic writes rules, like must be > naturally aligned > - Set REQ_ATOMIC > > Signed-off-by: John Garry <john.g.garry@oracle.com> > --- > block/fops.c | 42 +++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 41 insertions(+), 1 deletion(-) > > diff --git a/block/fops.c b/block/fops.c > index acff3d5d22d4..516669ad69e5 100644 > --- a/block/fops.c > +++ b/block/fops.c > @@ -41,6 +41,29 @@ static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos, > !bdev_iter_is_aligned(bdev, iter); > } > > +static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos, > + struct iov_iter *iter) > +{ > + unsigned int atomic_write_unit_min_bytes = > + queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev)); > + unsigned int atomic_write_unit_max_bytes = > + queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev)); > + > + if (!atomic_write_unit_min_bytes) > + return false; The above check should have be moved to limit setting code path. > + if (pos % atomic_write_unit_min_bytes) > + return false; > + if (iov_iter_count(iter) % atomic_write_unit_min_bytes) > + return false; > + if (!is_power_of_2(iov_iter_count(iter))) > + return false; > + if (iov_iter_count(iter) > atomic_write_unit_max_bytes) > + return false; > + if (pos % iov_iter_count(iter)) > + return false; I am a bit confused about relation between atomic_write_unit_max_bytes and atomic_write_max_bytes. Here the max IO length is limited to be <= atomic_write_unit_max_bytes, so looks userspace can only submit IO with write-atomic-unit naturally aligned IO(such as, 4k, 8k, 16k, 32k, ...), but these user IOs are allowed to be merged to big one if naturally alignment is respected and the merged IO size is <= atomic_write_max_bytes. Is my understanding right? If yes, I'd suggest to document the point, and the last two checks could be change to: /* naturally aligned */ if (pos % iov_iter_count(iter)) return false; if (iov_iter_count(iter) > atomic_write_max_bytes) return false; Thanks, Ming
On 04/12/2023 02:30, Ming Lei wrote: Hi Ming, >> +static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos, >> + struct iov_iter *iter) >> +{ >> + unsigned int atomic_write_unit_min_bytes = >> + queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev)); >> + unsigned int atomic_write_unit_max_bytes = >> + queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev)); >> + >> + if (!atomic_write_unit_min_bytes) >> + return false; > The above check should have be moved to limit setting code path. Sorry, I didn't fully understand your point. I added this here (as opposed to the caller), as I was not really worried about speeding up the failure path. Are you saying to call even earlier in submission path? > >> + if (pos % atomic_write_unit_min_bytes) >> + return false; >> + if (iov_iter_count(iter) % atomic_write_unit_min_bytes) >> + return false; >> + if (!is_power_of_2(iov_iter_count(iter))) >> + return false; >> + if (iov_iter_count(iter) > atomic_write_unit_max_bytes) >> + return false; >> + if (pos % iov_iter_count(iter)) >> + return false; > I am a bit confused about relation between atomic_write_unit_max_bytes and > atomic_write_max_bytes. I think that naming could be improved. Or even just drop merging (and atomic_write_max_bytes concept) until we show it to improve performance. So generally atomic_write_unit_max_bytes will be same as atomic_write_max_bytes, however it could be different if: a. request queue nr hw segments or other request queue limits needs to restrict atomic_write_unit_max_bytes b. atomic_write_unit_max_bytes does not need to be a power-of-2 and atomic_write_max_bytes does. So essentially: atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes) > > Here the max IO length is limited to be <= atomic_write_unit_max_bytes, > so looks userspace can only submit IO with write-atomic-unit naturally > aligned IO(such as, 4k, 8k, 16k, 32k, ...), correct > but these user IOs are > allowed to be merged to big one if naturally alignment is respected and > the merged IO size is <= atomic_write_max_bytes. correct, but the resultant merged IO does not have have to be naturally aligned. > > Is my understanding right? Yes, but... > If yes, I'd suggest to document the point, > and the last two checks could be change to: > > /* naturally aligned */ > if (pos % iov_iter_count(iter)) > return false; > > if (iov_iter_count(iter) > atomic_write_max_bytes) > return false; .. we would not be merging at this point as this is just IO submission to the block layer, so atomic_write_max_bytes does not come into play yet. If you check patch 7/21, you will see that we limit IO size to atomic_write_max_bytes, which is relevant merging. Thanks, John
On Mon, Dec 04, 2023 at 09:27:00AM +0000, John Garry wrote: > On 04/12/2023 02:30, Ming Lei wrote: > > Hi Ming, > > > > +static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos, > > > + struct iov_iter *iter) > > > +{ > > > + unsigned int atomic_write_unit_min_bytes = > > > + queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev)); > > > + unsigned int atomic_write_unit_max_bytes = > > > + queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev)); > > > + > > > + if (!atomic_write_unit_min_bytes) > > > + return false; > > The above check should have be moved to limit setting code path. > > Sorry, I didn't fully understand your point. > > I added this here (as opposed to the caller), as I was not really worried > about speeding up the failure path. Are you saying to call even earlier in > submission path? atomic_write_unit_min is one hardware property, and it should be checked in blk_queue_atomic_write_unit_min_sectors() from beginning, then you can avoid this check every other where. > > > > > > + if (pos % atomic_write_unit_min_bytes) > > > + return false; > > > + if (iov_iter_count(iter) % atomic_write_unit_min_bytes) > > > + return false; > > > + if (!is_power_of_2(iov_iter_count(iter))) > > > + return false; > > > + if (iov_iter_count(iter) > atomic_write_unit_max_bytes) > > > + return false; > > > + if (pos % iov_iter_count(iter)) > > > + return false; > > I am a bit confused about relation between atomic_write_unit_max_bytes and > > atomic_write_max_bytes. > > I think that naming could be improved. Or even just drop merging (and > atomic_write_max_bytes concept) until we show it to improve performance. > > So generally atomic_write_unit_max_bytes will be same as > atomic_write_max_bytes, however it could be different if: > a. request queue nr hw segments or other request queue limits needs to > restrict atomic_write_unit_max_bytes > b. atomic_write_unit_max_bytes does not need to be a power-of-2 and > atomic_write_max_bytes does. So essentially: > atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes) > plug merge often improves sequential IO perf, so if the hardware supports this way, I think 'atomic_write_max_bytes' should be supported from the beginning, such as: - user space submits sequential N * (4k, 8k, 16k, ...) atomic writes, all can be merged to single IO request, which is issued to driver. Or - user space submits sequential 4k, 4k, 8k, 16K, 32k, 64k atomic writes, all can be merged to single IO request, which is issued to driver. The hardware should recognize unit size by start LBA, and check if length is valid, so probably the interface might be relaxed to: 1) start lba is unit aligned, and this unit is in the supported unit range(power_2 in [unit_min, unit_max]) 2) length needs to be: - N * this_unit_size - <= atomic_write_max_bytes > > > > Here the max IO length is limited to be <= atomic_write_unit_max_bytes, > > so looks userspace can only submit IO with write-atomic-unit naturally > > aligned IO(such as, 4k, 8k, 16k, 32k, ...), > > correct > > > but these user IOs are > > allowed to be merged to big one if naturally alignment is respected and > > the merged IO size is <= atomic_write_max_bytes. > > correct, but the resultant merged IO does not have have to be naturally > aligned. > > > > > Is my understanding right? > > Yes, but... > > > If yes, I'd suggest to document the point, > > and the last two checks could be change to: > > > > /* naturally aligned */ > > if (pos % iov_iter_count(iter)) > > return false; > > > > if (iov_iter_count(iter) > atomic_write_max_bytes) > > return false; > > .. we would not be merging at this point as this is just IO submission to > the block layer, so atomic_write_max_bytes does not come into play yet. If > you check patch 7/21, you will see that we limit IO size to > atomic_write_max_bytes, which is relevant merging. I know the motivation of atomic_write_max_bytes, and now I am wondering atomic_write_max_bytes may be exported to userspace for the sake of atomic write performance. Thanks, Ming
>> >> I added this here (as opposed to the caller), as I was not really worried >> about speeding up the failure path. Are you saying to call even earlier in >> submission path? > atomic_write_unit_min is one hardware property, and it should be checked > in blk_queue_atomic_write_unit_min_sectors() from beginning, then you > can avoid this check every other where. ok, but we still need to ensure in the submission path that the block device actually supports atomic writes - this was the initial check. > >>>> + if (pos % atomic_write_unit_min_bytes) >>>> + return false; >>>> + if (iov_iter_count(iter) % atomic_write_unit_min_bytes) >>>> + return false; >>>> + if (!is_power_of_2(iov_iter_count(iter))) >>>> + return false; >>>> + if (iov_iter_count(iter) > atomic_write_unit_max_bytes) >>>> + return false; >>>> + if (pos % iov_iter_count(iter)) >>>> + return false; >>> I am a bit confused about relation between atomic_write_unit_max_bytes and >>> atomic_write_max_bytes. >> I think that naming could be improved. Or even just drop merging (and >> atomic_write_max_bytes concept) until we show it to improve performance. >> >> So generally atomic_write_unit_max_bytes will be same as >> atomic_write_max_bytes, however it could be different if: >> a. request queue nr hw segments or other request queue limits needs to >> restrict atomic_write_unit_max_bytes >> b. atomic_write_unit_max_bytes does not need to be a power-of-2 and >> atomic_write_max_bytes does. So essentially: >> atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes) >> > plug merge often improves sequential IO perf, so if the hardware supports > this way, I think 'atomic_write_max_bytes' should be supported from the > beginning, such as: > > - user space submits sequential N * (4k, 8k, 16k, ...) atomic writes, all can > be merged to single IO request, which is issued to driver. > > Or > > - user space submits sequential 4k, 4k, 8k, 16K, 32k, 64k atomic writes, all can > be merged to single IO request, which is issued to driver. Right, we do expect userspace to use a fixed block size, but we give scope in the API to use variable size. > > The hardware should recognize unit size by start LBA, and check if length is > valid, so probably the interface might be relaxed to: > > 1) start lba is unit aligned, and this unit is in the supported unit > range(power_2 in [unit_min, unit_max]) > > 2) length needs to be: > > - N * this_unit_size > - <= atomic_write_max_bytes Please note that we also need to consider: - any atomic write boundary (from NVMe) - virt boundary (from NVMe) And, as I mentioned elsewhere, I am still not 100% comfortable that we don't pay attention to regular max_sectors_kb... > > >>> Here the max IO length is limited to be <= atomic_write_unit_max_bytes, >>> so looks userspace can only submit IO with write-atomic-unit naturally >>> aligned IO(such as, 4k, 8k, 16k, 32k, ...), >> correct >> >>> but these user IOs are >>> allowed to be merged to big one if naturally alignment is respected and >>> the merged IO size is <= atomic_write_max_bytes. >> correct, but the resultant merged IO does not have have to be naturally >> aligned. >> >>> Is my understanding right? >> Yes, but... >> >>> If yes, I'd suggest to document the point, >>> and the last two checks could be change to: >>> >>> /* naturally aligned */ >>> if (pos % iov_iter_count(iter)) >>> return false; >>> >>> if (iov_iter_count(iter) > atomic_write_max_bytes) >>> return false; >> .. we would not be merging at this point as this is just IO submission to >> the block layer, so atomic_write_max_bytes does not come into play yet. If >> you check patch 7/21, you will see that we limit IO size to >> atomic_write_max_bytes, which is relevant merging. > I know the motivation of atomic_write_max_bytes, and now I am wondering > atomic_write_max_bytes may be exported to userspace for the sake of > atomic write performance. It is available from sysfs for the request queue, but in an earlier series Dave Chinner suggested doing more to expose to the application programmer. So here that would mean a statx member. I'm still not sure... it just didn't seem like a detail which the user would need to know or be able to do much with. Thanks, John
On Mon, Dec 04, 2023 at 01:13:55PM +0000, John Garry wrote: > > > > > > > I added this here (as opposed to the caller), as I was not really worried > > > about speeding up the failure path. Are you saying to call even earlier in > > > submission path? > > atomic_write_unit_min is one hardware property, and it should be checked > > in blk_queue_atomic_write_unit_min_sectors() from beginning, then you > > can avoid this check every other where. > > ok, but we still need to ensure in the submission path that the block device > actually supports atomic writes - this was the initial check. Then you may add one helper bdev_support_atomic_write(). > > > > > > > > + if (pos % atomic_write_unit_min_bytes) > > > > > + return false; > > > > > + if (iov_iter_count(iter) % atomic_write_unit_min_bytes) > > > > > + return false; > > > > > + if (!is_power_of_2(iov_iter_count(iter))) > > > > > + return false; > > > > > + if (iov_iter_count(iter) > atomic_write_unit_max_bytes) > > > > > + return false; > > > > > + if (pos % iov_iter_count(iter)) > > > > > + return false; > > > > I am a bit confused about relation between atomic_write_unit_max_bytes and > > > > atomic_write_max_bytes. > > > I think that naming could be improved. Or even just drop merging (and > > > atomic_write_max_bytes concept) until we show it to improve performance. > > > > > > So generally atomic_write_unit_max_bytes will be same as > > > atomic_write_max_bytes, however it could be different if: > > > a. request queue nr hw segments or other request queue limits needs to > > > restrict atomic_write_unit_max_bytes > > > b. atomic_write_unit_max_bytes does not need to be a power-of-2 and > > > atomic_write_max_bytes does. So essentially: > > > atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes) > > > > > plug merge often improves sequential IO perf, so if the hardware supports > > this way, I think 'atomic_write_max_bytes' should be supported from the > > beginning, such as: > > > > - user space submits sequential N * (4k, 8k, 16k, ...) atomic writes, all can > > be merged to single IO request, which is issued to driver. > > > > Or > > > > - user space submits sequential 4k, 4k, 8k, 16K, 32k, 64k atomic writes, all can > > be merged to single IO request, which is issued to driver. > > Right, we do expect userspace to use a fixed block size, but we give scope > in the API to use variable size. Maybe it is enough to just take atomic_write_unit_min_bytes only, and allow length to be N * atomic_write_unit_min_bytes. But it may violate atomic write boundary? > > > > > The hardware should recognize unit size by start LBA, and check if length is > > valid, so probably the interface might be relaxed to: > > > > 1) start lba is unit aligned, and this unit is in the supported unit > > range(power_2 in [unit_min, unit_max]) > > > > 2) length needs to be: > > > > - N * this_unit_size > > - <= atomic_write_max_bytes > > Please note that we also need to consider: > - any atomic write boundary (from NVMe) Can you provide actual NVMe boundary value? Firstly natural aligned write won't cross boundary, so boundary should be >= write_unit_max, see blow code from patch 10/21: +static bool bio_straddles_atomic_write_boundary(loff_t bi_sector, + unsigned int bi_size, + unsigned int boundary) +{ + loff_t start = bi_sector << SECTOR_SHIFT; + loff_t end = start + bi_size; + loff_t start_mod = start % boundary; + loff_t end_mod = end % boundary; + + if (end - start > boundary) + return true; + if ((start_mod > end_mod) && (start_mod && end_mod)) + return true; + + return false; +} + Then if the WRITE size is <= boundary, the above function should return false, right? Looks like it is power_of(2) & aligned atomic_write_max_bytes? > - virt boundary (from NVMe) virt boundary is applied on bv_offset and bv_len, and NVMe's virt bounary is (4k - 1), it shouldn't be one issue in reality. > > And, as I mentioned elsewhere, I am still not 100% comfortable that we don't > pay attention to regular max_sectors_kb... max_sectors_kb should be bigger than atomic_write_max_bytes actually, then what is your concern? Thanks, Ming
On 05/12/2023 01:45, Ming Lei wrote: >> Right, we do expect userspace to use a fixed block size, but we give scope >> in the API to use variable size. > Maybe it is enough to just take atomic_write_unit_min_bytes > only, and allow length to be N * atomic_write_unit_min_bytes. > > But it may violate atomic write boundary? About atomic boundary, we just don't allow a merge which will result in a write which will straddle a boundary as there are no guarantees of atomicity then. Having said this, atomic write boundary is just relevant to NVMe, so if we don't have merges there, then we could just omit this code. > >>> The hardware should recognize unit size by start LBA, and check if length is >>> valid, so probably the interface might be relaxed to: >>> >>> 1) start lba is unit aligned, and this unit is in the supported unit >>> range(power_2 in [unit_min, unit_max]) >>> >>> 2) length needs to be: >>> >>> - N * this_unit_size >>> - <= atomic_write_max_bytes >> Please note that we also need to consider: >> - any atomic write boundary (from NVMe) > Can you provide actual NVMe boundary value? > > Firstly natural aligned write won't cross boundary, so boundary should > be >= write_unit_max, Correct > see blow code from patch 10/21: > > +static bool bio_straddles_atomic_write_boundary(loff_t bi_sector, > + unsigned int bi_size, > + unsigned int boundary) > +{ > + loff_t start = bi_sector << SECTOR_SHIFT; > + loff_t end = start + bi_size; > + loff_t start_mod = start % boundary; > + loff_t end_mod = end % boundary; > + > + if (end - start > boundary) > + return true; > + if ((start_mod > end_mod) && (start_mod && end_mod)) > + return true; > + > + return false; > +} > + > > Then if the WRITE size is <= boundary, the above function should return > false, right? Actually if WRITE > boundary then we must be crossing a boundary and should return true, which is what the first condition checks. However 2x naturally-aligned atomic writes could be less than atomic_write_max_bytes but still straddle if merged. > Looks like it is power_of(2) & aligned atomic_write_max_bytes? > >> - virt boundary (from NVMe) > virt boundary is applied on bv_offset and bv_len, and NVMe's virt > bounary is (4k - 1), it shouldn't be one issue in reality. On a related topic, as I understand, for NVMe we just split bios according to virt boundary for PRP, but we won't always use PRP. So is there value in not splitting bio according to PRP if SGL will actually be used? > >> And, as I mentioned elsewhere, I am still not 100% comfortable that we don't >> pay attention to regular max_sectors_kb... > max_sectors_kb should be bigger than atomic_write_max_bytes actually, > then what is your concern? My concern is that we don't enforce that, so we may issue atomic writes which exceed max_sectors_kb. If we did enforce it, then atomic_write_unit_min_bytes, atomic_write_unit_max_bytes, and atomic_write_max_bytes all need to be limited according to max_sectors_kb. Thanks, John
diff --git a/block/fops.c b/block/fops.c index acff3d5d22d4..516669ad69e5 100644 --- a/block/fops.c +++ b/block/fops.c @@ -41,6 +41,29 @@ static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos, !bdev_iter_is_aligned(bdev, iter); } +static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos, + struct iov_iter *iter) +{ + unsigned int atomic_write_unit_min_bytes = + queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev)); + unsigned int atomic_write_unit_max_bytes = + queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev)); + + if (!atomic_write_unit_min_bytes) + return false; + if (pos % atomic_write_unit_min_bytes) + return false; + if (iov_iter_count(iter) % atomic_write_unit_min_bytes) + return false; + if (!is_power_of_2(iov_iter_count(iter))) + return false; + if (iov_iter_count(iter) > atomic_write_unit_max_bytes) + return false; + if (pos % iov_iter_count(iter)) + return false; + return true; +} + #define DIO_INLINE_BIO_VECS 4 static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb, @@ -48,6 +71,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb, { struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs; + bool is_read = iov_iter_rw(iter) == READ; + bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read; loff_t pos = iocb->ki_pos; bool should_dirty = false; struct bio bio; @@ -56,6 +81,9 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb, if (blkdev_dio_unaligned(bdev, pos, iter)) return -EINVAL; + if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter)) + return -EINVAL; + if (nr_pages <= DIO_INLINE_BIO_VECS) vecs = inline_vecs; else { @@ -65,7 +93,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb, return -ENOMEM; } - if (iov_iter_rw(iter) == READ) { + if (is_read) { bio_init(&bio, bdev, vecs, nr_pages, REQ_OP_READ); if (user_backed_iter(iter)) should_dirty = true; @@ -74,6 +102,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb, } bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT; bio.bi_ioprio = iocb->ki_ioprio; + if (atomic_write) + bio.bi_opf |= REQ_ATOMIC; ret = bio_iov_iter_get_pages(&bio, iter); if (unlikely(ret)) @@ -167,10 +197,14 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, struct blkdev_dio *dio; struct bio *bio; bool is_read = (iov_iter_rw(iter) == READ), is_sync; + bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read; blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb); loff_t pos = iocb->ki_pos; int ret = 0; + if (atomic_write) + return -EINVAL; + if (blkdev_dio_unaligned(bdev, pos, iter)) return -EINVAL; @@ -305,6 +339,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb, struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); bool is_read = iov_iter_rw(iter) == READ; blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb); + bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read; struct blkdev_dio *dio; struct bio *bio; loff_t pos = iocb->ki_pos; @@ -313,6 +348,9 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb, if (blkdev_dio_unaligned(bdev, pos, iter)) return -EINVAL; + if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter)) + return -EINVAL; + if (iocb->ki_flags & IOCB_ALLOC_CACHE) opf |= REQ_ALLOC_CACHE; bio = bio_alloc_bioset(bdev, nr_pages, opf, GFP_KERNEL, @@ -347,6 +385,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb, bio_set_pages_dirty(bio); } } else { + if (atomic_write) + bio->bi_opf |= REQ_ATOMIC; task_io_account_write(bio->bi_iter.bi_size); }
Add support for atomic writes, as follows: - Ensure that the IO follows all the atomic writes rules, like must be naturally aligned - Set REQ_ATOMIC Signed-off-by: John Garry <john.g.garry@oracle.com> --- block/fops.c | 42 +++++++++++++++++++++++++++++++++++++++++- 1 file changed, 41 insertions(+), 1 deletion(-)