diff mbox series

[v2,3/3] zonefs: fix synchronous write to sequential zone files

Message ID 20210417023323.852530-4-damien.lemoal@wdc.com (mailing list archive)
State New, archived
Headers show
Series Fix dm-crypt zoned block device support | expand

Commit Message

Damien Le Moal April 17, 2021, 2:33 a.m. UTC
Synchronous writes to sequential zone files cannot use zone append
operations if the underlying zoned device queue limit
max_zone_append_sectors is 0, indicating that the device does not
support this operation. In this case, fall back to using regular write
operations.

Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/zonefs/super.c  | 16 ++++++++++++----
 fs/zonefs/zonefs.h |  2 ++
 2 files changed, 14 insertions(+), 4 deletions(-)

Comments

Christoph Hellwig April 19, 2021, 6:45 a.m. UTC | #1
On Sat, Apr 17, 2021 at 11:33:23AM +0900, Damien Le Moal wrote:
> Synchronous writes to sequential zone files cannot use zone append
> operations if the underlying zoned device queue limit
> max_zone_append_sectors is 0, indicating that the device does not
> support this operation. In this case, fall back to using regular write
> operations.

Zone append is a mandatory feature of the zoned device API.
Damien Le Moal April 19, 2021, 7:08 a.m. UTC | #2
On 2021/04/19 15:45, Christoph Hellwig wrote:
> On Sat, Apr 17, 2021 at 11:33:23AM +0900, Damien Le Moal wrote:
>> Synchronous writes to sequential zone files cannot use zone append
>> operations if the underlying zoned device queue limit
>> max_zone_append_sectors is 0, indicating that the device does not
>> support this operation. In this case, fall back to using regular write
>> operations.
> 
> Zone append is a mandatory feature of the zoned device API.

Yes, I am well aware of that. All physical zoned devices and null blk do support
zone append, but the logical device created by dm-crypt is out. And we cannot
simply disable zone support in dm-crypt as there are use cases out there in the
field that I am aware of, in SMR space.

So this series is a compromise: preserve dm-crypt zone support for SMR (no one
uses the zone append emulation yet, as far as I know) by disabling zone append.

For zonefs, we can:
1) refuse to mount if ZA is disabled, same as btrfs
2) Do as I did in the patch, fallback to regular writes since that is easy to do
(zonefs file size tracks the WP position already).

I chose option (2) to allow for SMR+dm-crypt to still work with zonefs.
Christoph Hellwig April 19, 2021, 7:10 a.m. UTC | #3
On Mon, Apr 19, 2021 at 07:08:46AM +0000, Damien Le Moal wrote:
> 1) refuse to mount if ZA is disabled, same as btrfs

Yes, please do that.
Douglas Gilbert April 20, 2021, 1:20 a.m. UTC | #4
On 2021-04-19 2:45 a.m., Christoph Hellwig wrote:
> On Sat, Apr 17, 2021 at 11:33:23AM +0900, Damien Le Moal wrote:
>> Synchronous writes to sequential zone files cannot use zone append
>> operations if the underlying zoned device queue limit
>> max_zone_append_sectors is 0, indicating that the device does not
>> support this operation. In this case, fall back to using regular write
>> operations.
> 
> Zone append is a mandatory feature of the zoned device API.

So a hack required for ZNS and not needed by ZBC and ZAC becomes
a "mandatory feature" in a Linux API. Like many hacks, that one might
come back to bite you :-)

Doug Gilbert
Damien Le Moal April 20, 2021, 1:35 a.m. UTC | #5
On 2021/04/20 10:20, Douglas Gilbert wrote:
> On 2021-04-19 2:45 a.m., Christoph Hellwig wrote:
>> On Sat, Apr 17, 2021 at 11:33:23AM +0900, Damien Le Moal wrote:
>>> Synchronous writes to sequential zone files cannot use zone append
>>> operations if the underlying zoned device queue limit
>>> max_zone_append_sectors is 0, indicating that the device does not
>>> support this operation. In this case, fall back to using regular write
>>> operations.
>>
>> Zone append is a mandatory feature of the zoned device API.
> 
> So a hack required for ZNS and not needed by ZBC and ZAC becomes
> a "mandatory feature" in a Linux API. Like many hacks, that one might
> come back to bite you :-)

Zone append is not a hack in ZNS. It is a write interface that fits very well
with the multi-queue nature of NVMe. The "hack" is the emulation in scsi.

We decided on having this mandatory for zoned devices (all types) to make sure
that file systems do not have to implement different IO paths for sequential
writing to zones. Zone append does simplify a lot of things and allows to get
the best performance from ZNS drives. Zone write locking/serialization of writes
per zones using regular writes is much harder to implement, make a mess of the
file system code, and would kill write performance on ZNS.
diff mbox series

Patch

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 049e36c69ed7..b97566b9dff7 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -689,14 +689,15 @@  static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-	struct block_device *bdev = inode->i_sb->s_bdev;
-	unsigned int max;
+	struct super_block *sb = inode->i_sb;
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+	struct block_device *bdev = sb->s_bdev;
+	sector_t max = sbi->s_max_zone_append_sectors;
 	struct bio *bio;
 	ssize_t size;
 	int nr_pages;
 	ssize_t ret;
 
-	max = queue_max_zone_append_sectors(bdev_get_queue(bdev));
 	max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize);
 	iov_iter_truncate(from, max);
 
@@ -853,6 +854,8 @@  static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 
 	/* Enforce sequential writes (append only) in sequential zones */
 	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ) {
+		struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+
 		mutex_lock(&zi->i_truncate_mutex);
 		if (iocb->ki_pos != zi->i_wpoffset) {
 			mutex_unlock(&zi->i_truncate_mutex);
@@ -860,7 +863,7 @@  static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 			goto inode_unlock;
 		}
 		mutex_unlock(&zi->i_truncate_mutex);
-		append = sync;
+		append = sync && sbi->s_max_zone_append_sectors;
 	}
 
 	if (append)
@@ -1683,6 +1686,11 @@  static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 		sbi->s_mount_opts &= ~ZONEFS_MNTOPT_EXPLICIT_OPEN;
 	}
 
+	sbi->s_max_zone_append_sectors =
+		queue_max_zone_append_sectors(bdev_get_queue(sb->s_bdev));
+	if (!sbi->s_max_zone_append_sectors)
+		zonefs_info(sb, "Zone append is not supported: falling back to using regular writes\n");
+
 	ret = zonefs_read_super(sb);
 	if (ret)
 		return ret;
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 51141907097c..2b8c3b1a32ea 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -185,6 +185,8 @@  struct zonefs_sb_info {
 
 	unsigned int		s_max_open_zones;
 	atomic_t		s_open_zones;
+
+	sector_t		s_max_zone_append_sectors;
 };
 
 static inline struct zonefs_sb_info *ZONEFS_SB(struct super_block *sb)