[v8,1/2] fs: New zonefs file system
diff mbox series

Message ID 20200121065846.216538-2-damien.lemoal@wdc.com
State Superseded
Headers show
Series
  • New zonefs file system
Related show

Commit Message

Damien Le Moal Jan. 21, 2020, 6:58 a.m. UTC
zonefs is a very simple file system exposing each zone of a zoned block
device as a file. Unlike a regular file system with zoned block device
support (e.g. f2fs), zonefs does not hide the sequential write
constraint of zoned block devices to the user. Files representing
sequential write zones of the device must be written sequentially
starting from the end of the file (append only writes).

As such, zonefs is in essence closer to a raw block device access
interface than to a full featured POSIX file system. The goal of zonefs
is to simplify the implementation of zoned block device support in
applications by replacing raw block device file accesses with a richer
file API, avoiding relying on direct block device file ioctls which may
be more obscure to developers. One example of this approach is the
implementation of LSM (log-structured merge) tree structures (such as
used in RocksDB and LevelDB) on zoned block devices by allowing SSTables
to be stored in a zone file similarly to a regular file system rather
than as a range of sectors of a zoned device. The introduction of the
higher level construct "one file is one zone" can help reducing the
amount of changes needed in the application as well as introducing
support for different application programming languages.

Zonefs on-disk metadata is reduced to an immutable super block to
persistently store a magic number and optional feature flags and
values. On mount, zonefs uses blkdev_report_zones() to obtain the device
zone configuration and populates the mount point with a static file tree
solely based on this information. E.g. file sizes come from the device
zone type and write pointer offset managed by the device itself.

The zone files created on mount have the following characteristics.
1) Files representing zones of the same type are grouped together
   under a common sub-directory:
     * For conventional zones, the sub-directory "cnv" is used.
     * For sequential write zones, the sub-directory "seq" is used.
  These two directories are the only directories that exist in zonefs.
  Users cannot create other directories and cannot rename nor delete
  the "cnv" and "seq" sub-directories.
2) The name of zone files is the number of the file within the zone
   type sub-directory, in order of increasing zone start sector.
3) The size of conventional zone files is fixed to the device zone size.
   Conventional zone files cannot be truncated.
4) The size of sequential zone files represent the file's zone write
   pointer position relative to the zone start sector. Truncating these
   files is allowed only down to 0, in which case, the zone is reset to
   rewind the zone write pointer position to the start of the zone, or
   up to the zone size, in which case the file's zone is transitioned
   to the FULL state (finish zone operation).
5) All read and write operations to files are not allowed beyond the
   file zone size. Any access exceeding the zone size is failed with
   the -EFBIG error.
6) Creating, deleting, renaming or modifying any attribute of files and
   sub-directories is not allowed.
7) There are no restrictions on the type of read and write operations
   that can be issued to conventional zone files. Buffered, direct and
   mmap read & write operations are accepted. For sequential zone files,
   there are no restrictions on read operations, but all write
   operations must be direct IO append writes. mmap write of sequential
   files is not allowed.

Several optional features of zonefs can be enabled at format time.
* Conventional zone aggregation: ranges of contiguous conventional
  zones can be aggregated into a single larger file instead of the
  default one file per zone.
* File ownership: The owner UID and GID of zone files is by default 0
  (root) but can be changed to any valid UID/GID.
* File access permissions: the default 640 access permissions can be
  changed.

The mkzonefs tool is used to format zoned block devices for use with
zonefs. This tool is available on Github at:

git@github.com:damien-lemoal/zonefs-tools.git.

zonefs-tools also includes a test suite which can be run against any
zoned block device, including null_blk block device created with zoned
mode.

Example: the following formats a 15TB host-managed SMR HDD with 256 MB
zones with the conventional zones aggregation feature enabled.

$ sudo mkzonefs -o aggr_cnv /dev/sdX
$ sudo mount -t zonefs /dev/sdX /mnt
$ ls -l /mnt/
total 0
dr-xr-xr-x 2 root root     1 Nov 25 13:23 cnv
dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq

The size of the zone files sub-directories indicate the number of files
existing for each type of zones. In this example, there is only one
conventional zone file (all conventional zones are aggregated under a
single file).

$ ls -l /mnt/cnv
total 137101312
-rw-r----- 1 root root 140391743488 Nov 25 13:23 0

This aggregated conventional zone file can be used as a regular file.

$ sudo mkfs.ext4 /mnt/cnv/0
$ sudo mount -o loop /mnt/cnv/0 /data

The "seq" sub-directory grouping files for sequential write zones has
in this example 55356 zones.

$ ls -lv /mnt/seq
total 14511243264
-rw-r----- 1 root root 0 Nov 25 13:23 0
-rw-r----- 1 root root 0 Nov 25 13:23 1
-rw-r----- 1 root root 0 Nov 25 13:23 2
...
-rw-r----- 1 root root 0 Nov 25 13:23 55354
-rw-r----- 1 root root 0 Nov 25 13:23 55355

For sequential write zone files, the file size changes as data is
appended at the end of the file, similarly to any regular file system.

$ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s

$ ls -l /mnt/seq/0
-rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0

The written file can be truncated to the zone size, preventing any
further write operation.

$ truncate -s 268435456 /mnt/seq/0
$ ls -l /mnt/seq/0
-rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0

Truncation to 0 size allows freeing the file zone storage space and
restart append-writes to the file.

$ truncate -s 0 /mnt/seq/0
$ ls -l /mnt/seq/0
-rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0

Since files are statically mapped to zones on the disk, the number of
blocks of a file as reported by stat() and fstat() indicates the size
of the file zone.

$ stat /mnt/seq/0
  File: /mnt/seq/0
  Size: 0       Blocks: 524288     IO Block: 4096   regular empty file
Device: 870h/2160d      Inode: 50431       Links: 1
Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/  root)
Access: 2019-11-25 13:23:57.048971997 +0900
Modify: 2019-11-25 13:52:25.553805765 +0900
Change: 2019-11-25 13:52:25.553805765 +0900
 Birth: -

The number of blocks of the file ("Blocks") in units of 512B blocks
gives the maximum file size of 524288 * 512 B = 256 MB, corresponding
to the device zone size in this example. Of note is that the "IO block"
field always indicates the minimum IO size for writes and corresponds
to the device physical sector size.

This code contains contributions from:
* Johannes Thumshirn <jthumshirn@suse.de>,
* Darrick J. Wong <darrick.wong@oracle.com>,
* Christoph Hellwig <hch@lst.de>,
* Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> and
* Ting Yao <tingyao@hust.edu.cn>.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 MAINTAINERS                |    9 +
 fs/Kconfig                 |    1 +
 fs/Makefile                |    1 +
 fs/zonefs/Kconfig          |    9 +
 fs/zonefs/Makefile         |    4 +
 fs/zonefs/super.c          | 1178 ++++++++++++++++++++++++++++++++++++
 fs/zonefs/zonefs.h         |  175 ++++++
 include/uapi/linux/magic.h |    1 +
 8 files changed, 1378 insertions(+)
 create mode 100644 fs/zonefs/Kconfig
 create mode 100644 fs/zonefs/Makefile
 create mode 100644 fs/zonefs/super.c
 create mode 100644 fs/zonefs/zonefs.h

Comments

Darrick J. Wong Jan. 21, 2020, 4:44 p.m. UTC | #1
On Tue, Jan 21, 2020 at 03:58:45PM +0900, Damien Le Moal wrote:
> zonefs is a very simple file system exposing each zone of a zoned block
> device as a file. Unlike a regular file system with zoned block device

(Still) looks good to me, so with the below typo fix, you can apply:
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

I think it's time to scoop up whatever acks and rvbs you get for this v8
series and send v9 as a pull request for 5.6.

> +static int zonefs_file_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	/*
> +	 * Since conventional zones accept random writes, conventioanl zone

"conventional" at the end.

(or shorten the sentence to "Conventional zones accept random writes, so
their zone files can support shared writable mappings.")

--D

> +	 * files can support shared writeable mappings. For sequential zone
> +	 * files, only readonly mappings are possible since there no gurantees
> +	 * for write ordering due to msync() and page cache writeback.
> +	 */
> +	if (ZONEFS_I(file_inode(file))->i_ztype == ZONEFS_ZTYPE_SEQ &&
> +	    (vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
> +		return -EINVAL;
> +
> +	file_accessed(file);
> +	vma->vm_ops = &zonefs_file_vm_ops;
> +
> +	return 0;
> +}
> +
> +static loff_t zonefs_file_llseek(struct file *file, loff_t offset, int whence)
> +{
> +	loff_t isize = i_size_read(file_inode(file));
> +
> +	/*
> +	 * Seeks are limited to below the zone size for conventional zones
> +	 * and below the zone write pointer for sequential zones. In both
> +	 * cases, this limit is the inode size.
> +	 */
> +	return generic_file_llseek_size(file, offset, whence, isize, isize);
> +}
> +
> +static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> +	loff_t max_pos;
> +	size_t count;
> +	ssize_t ret;
> +
> +	if (iocb->ki_pos >= zi->i_max_size)
> +		return 0;
> +
> +	if (iocb->ki_flags & IOCB_NOWAIT) {
> +		if (!inode_trylock_shared(inode))
> +			return -EAGAIN;
> +	} else {
> +		inode_lock_shared(inode);
> +	}
> +
> +	mutex_lock(&zi->i_truncate_mutex);
> +
> +	/*
> +	 * Limit read operations to written data.
> +	 */
> +	max_pos = i_size_read(inode);
> +	if (iocb->ki_pos >= max_pos) {
> +		mutex_unlock(&zi->i_truncate_mutex);
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	iov_iter_truncate(to, max_pos - iocb->ki_pos);
> +
> +	mutex_unlock(&zi->i_truncate_mutex);
> +
> +	count = iov_iter_count(to);
> +
> +	if (iocb->ki_flags & IOCB_DIRECT) {
> +		if ((iocb->ki_pos | count) & sbi->s_blocksize_mask) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		file_accessed(iocb->ki_filp);
> +		ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops, NULL,
> +				   is_sync_kiocb(iocb));
> +	} else {
> +		ret = generic_file_read_iter(iocb, to);
> +	}
> +
> +out:
> +	inode_unlock_shared(inode);
> +
> +	return ret;
> +}
> +
> +static int zonefs_report_zones_err_cb(struct blk_zone *zone, unsigned int idx,
> +				      void *data)
> +{
> +	struct inode *inode = data;
> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> +	loff_t pos;
> +
> +	/*
> +	 * The condition of the zone may have change. Check it and adjust the
> +	 * inode information as needed, similarly to zonefs_init_file_inode().
> +	 */
> +	if (zone->cond == BLK_ZONE_COND_OFFLINE) {
> +		inode->i_flags |= S_IMMUTABLE;
> +		inode->i_mode &= ~0777;
> +		zone->wp = zone->start;
> +	} else if (zone->cond == BLK_ZONE_COND_READONLY) {
> +		inode->i_flags |= S_IMMUTABLE;
> +		inode->i_mode &= ~0222;
> +	}
> +
> +	pos = (zone->wp - zone->start) << SECTOR_SHIFT;
> +	zi->i_wpoffset = pos;
> +	if (i_size_read(inode) != pos) {
> +		zonefs_update_stats(inode, pos);
> +		i_size_write(inode, pos);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * When a write error occurs in a sequential zone, the zone write pointer
> + * position must be refreshed to correct the file size and zonefs inode
> + * write pointer offset.
> + */
> +static int zonefs_seq_file_write_failed(struct inode *inode, int error)
> +{
> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> +	struct super_block *sb = inode->i_sb;
> +	sector_t sector = zi->i_zsector;
> +	unsigned int nofs_flag;
> +	int ret;
> +
> +	zonefs_warn(sb, "Updating inode zone %llu info\n", sector);
> +
> +	/*
> +	 * blkdev_report_zones() uses GFP_KERNEL by default. Force execution as
> +	 * if GFP_NOFS was specified so that it will not end up recursing into
> +	 * the FS on memory allocation.
> +	 */
> +	nofs_flag = memalloc_nofs_save();
> +	ret = blkdev_report_zones(sb->s_bdev, sector, 1,
> +				  zonefs_report_zones_err_cb, inode);
> +	memalloc_nofs_restore(nofs_flag);
> +
> +	if (ret != 1) {
> +		if (!ret)
> +			ret = -EIO;
> +		zonefs_err(sb, "Get zone %llu report failed %d\n",
> +			   sector, ret);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static int zonefs_file_dio_write_end(struct kiocb *iocb, ssize_t size, int ret,
> +				     unsigned int flags)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> +
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * Conventional zone file size is fixed to the zone size so there
> +	 * is no need to do anything.
> +	 */
> +	if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
> +		return 0;
> +
> +	mutex_lock(&zi->i_truncate_mutex);
> +
> +	if (size < 0) {
> +		ret = zonefs_seq_file_write_failed(inode, size);
> +	} else if (i_size_read(inode) < iocb->ki_pos + size) {
> +		zonefs_update_stats(inode, iocb->ki_pos + size);
> +		i_size_write(inode, iocb->ki_pos + size);
> +	}
> +
> +	mutex_unlock(&zi->i_truncate_mutex);
> +
> +	return ret;
> +}
> +
> +static const struct iomap_dio_ops zonefs_dio_ops = {
> +	.end_io			= zonefs_file_dio_write_end,
> +};
> +
> +static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> +	size_t count;
> +	ssize_t ret;
> +
> +	if (iocb->ki_flags & IOCB_NOWAIT) {
> +		if (!inode_trylock(inode))
> +			return -EAGAIN;
> +	} else {
> +		inode_lock(inode);
> +	}
> +
> +	ret = generic_write_checks(iocb, from);
> +	if (ret <= 0)
> +		goto out;
> +
> +	iov_iter_truncate(from, zi->i_max_size - iocb->ki_pos);
> +	count = iov_iter_count(from);
> +
> +	/*
> +	 * Direct writes must be aligned to the block size, that is, the device
> +	 * physical sector size, to avoid errors when writing sequential zones
> +	 * on 512e devices (512B logical sector, 4KB physical sectors).
> +	 */
> +	if ((iocb->ki_pos | count) & sbi->s_blocksize_mask) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Enforce sequential writes (append only) in sequential zones.
> +	 */
> +	mutex_lock(&zi->i_truncate_mutex);
> +	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> +	    iocb->ki_pos != zi->i_wpoffset) {
> +		zonefs_err(inode->i_sb,
> +			   "Unaligned write at %llu + %zu (wp %llu)\n",
> +			   iocb->ki_pos, count,
> +			   zi->i_wpoffset);
> +		mutex_unlock(&zi->i_truncate_mutex);
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	mutex_unlock(&zi->i_truncate_mutex);
> +
> +	ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops, &zonefs_dio_ops,
> +			   is_sync_kiocb(iocb));
> +	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> +	    (ret > 0 || ret == -EIOCBQUEUED)) {
> +		if (ret > 0)
> +			count = ret;
> +		mutex_lock(&zi->i_truncate_mutex);
> +		zi->i_wpoffset += count;
> +		mutex_unlock(&zi->i_truncate_mutex);
> +	}
> +
> +out:
> +	inode_unlock(inode);
> +
> +	return ret;
> +}
> +
> +static ssize_t zonefs_file_buffered_write(struct kiocb *iocb,
> +					  struct iov_iter *from)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> +	ssize_t ret;
> +
> +	/*
> +	 * Direct IO writes are mandatory for sequential zones so that the
> +	 * write IO order is preserved.
> +	 */
> +	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ)
> +		return -EIO;
> +
> +	if (iocb->ki_flags & IOCB_NOWAIT) {
> +		if (!inode_trylock(inode))
> +			return -EAGAIN;
> +	} else {
> +		inode_lock(inode);
> +	}
> +
> +	ret = generic_write_checks(iocb, from);
> +	if (ret <= 0)
> +		goto out;
> +
> +	iov_iter_truncate(from, zi->i_max_size - iocb->ki_pos);
> +
> +	ret = iomap_file_buffered_write(iocb, from, &zonefs_iomap_ops);
> +	if (ret > 0)
> +		iocb->ki_pos += ret;
> +
> +out:
> +	inode_unlock(inode);
> +	if (ret > 0)
> +		ret = generic_write_sync(iocb, ret);
> +
> +	return ret;
> +}
> +
> +static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	/*
> +	 * Check that the write operation does not go beyond the zone size.
> +	 */
> +	if (iocb->ki_pos >= ZONEFS_I(inode)->i_max_size)
> +		return -EFBIG;
> +
> +	if (iocb->ki_flags & IOCB_DIRECT)
> +		return zonefs_file_dio_write(iocb, from);
> +
> +	return zonefs_file_buffered_write(iocb, from);
> +}
> +
> +static const struct file_operations zonefs_file_operations = {
> +	.open		= generic_file_open,
> +	.fsync		= zonefs_file_fsync,
> +	.mmap		= zonefs_file_mmap,
> +	.llseek		= zonefs_file_llseek,
> +	.read_iter	= zonefs_file_read_iter,
> +	.write_iter	= zonefs_file_write_iter,
> +	.splice_read	= generic_file_splice_read,
> +	.splice_write	= iter_file_splice_write,
> +	.iopoll		= iomap_dio_iopoll,
> +};
> +
> +static struct kmem_cache *zonefs_inode_cachep;
> +
> +static struct inode *zonefs_alloc_inode(struct super_block *sb)
> +{
> +	struct zonefs_inode_info *zi;
> +
> +	zi = kmem_cache_alloc(zonefs_inode_cachep, GFP_KERNEL);
> +	if (!zi)
> +		return NULL;
> +
> +	inode_init_once(&zi->i_vnode);
> +	mutex_init(&zi->i_truncate_mutex);
> +	init_rwsem(&zi->i_mmap_sem);
> +
> +	return &zi->i_vnode;
> +}
> +
> +static void zonefs_free_inode(struct inode *inode)
> +{
> +	kmem_cache_free(zonefs_inode_cachep, ZONEFS_I(inode));
> +}
> +
> +/*
> + * File system stat.
> + */
> +static int zonefs_statfs(struct dentry *dentry, struct kstatfs *buf)
> +{
> +	struct super_block *sb = dentry->d_sb;
> +	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
> +	enum zonefs_ztype t;
> +	u64 fsid;
> +
> +	buf->f_type = ZONEFS_MAGIC;
> +	buf->f_bsize = sb->s_blocksize;
> +	buf->f_namelen = ZONEFS_NAME_MAX;
> +
> +	spin_lock(&sbi->s_lock);
> +
> +	buf->f_blocks = sbi->s_blocks;
> +	if (WARN_ON(sbi->s_used_blocks > sbi->s_blocks))
> +		buf->f_bfree = 0;
> +	else
> +		buf->f_bfree = buf->f_blocks - sbi->s_used_blocks;
> +	buf->f_bavail = buf->f_bfree;
> +
> +	for (t = 0; t < ZONEFS_ZTYPE_MAX; t++) {
> +		if (sbi->s_nr_files[t])
> +			buf->f_files += sbi->s_nr_files[t] + 1;
> +	}
> +	buf->f_ffree = 0;
> +
> +	spin_unlock(&sbi->s_lock);
> +
> +	fsid = le64_to_cpup((void *)sbi->s_uuid.b) ^
> +		le64_to_cpup((void *)sbi->s_uuid.b + sizeof(u64));
> +	buf->f_fsid.val[0] = (u32)fsid;
> +	buf->f_fsid.val[1] = (u32)(fsid >> 32);
> +
> +	return 0;
> +}
> +
> +static const struct super_operations zonefs_sops = {
> +	.alloc_inode	= zonefs_alloc_inode,
> +	.free_inode	= zonefs_free_inode,
> +	.statfs		= zonefs_statfs,
> +};
> +
> +static const struct inode_operations zonefs_dir_inode_operations = {
> +	.lookup		= simple_lookup,
> +	.setattr	= zonefs_inode_setattr,
> +};
> +
> +static void zonefs_init_dir_inode(struct inode *parent, struct inode *inode)
> +{
> +	inode_init_owner(inode, parent, S_IFDIR | 0555);
> +	inode->i_op = &zonefs_dir_inode_operations;
> +	inode->i_fop = &simple_dir_operations;
> +	set_nlink(inode, 2);
> +	inc_nlink(parent);
> +}
> +
> +static void zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone)
> +{
> +	struct super_block *sb = inode->i_sb;
> +	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> +	umode_t	perm = sbi->s_perm;
> +
> +	if (zone->cond == BLK_ZONE_COND_OFFLINE) {
> +		/*
> +		 * Dead zone: make the inode immutable, disable all accesses
> +		 * and set the file size to 0.
> +		 */
> +		inode->i_flags |= S_IMMUTABLE;
> +		zone->wp = zone->start;
> +		perm &= ~0777;
> +	} else if (zone->cond == BLK_ZONE_COND_READONLY) {
> +		/* Do not allow writes in read-only zones */
> +		inode->i_flags |= S_IMMUTABLE;
> +		perm &= ~0222;
> +	}
> +
> +	zi->i_ztype = zonefs_zone_type(zone);
> +	zi->i_zsector = zone->start;
> +	zi->i_max_size = min_t(loff_t, MAX_LFS_FILESIZE,
> +			       zone->len << SECTOR_SHIFT);
> +	if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
> +		zi->i_wpoffset = zi->i_max_size;
> +	else
> +		zi->i_wpoffset = (zone->wp - zone->start) << SECTOR_SHIFT;
> +
> +	inode->i_mode = S_IFREG | perm;
> +	inode->i_uid = sbi->s_uid;
> +	inode->i_gid = sbi->s_gid;
> +	inode->i_size = zi->i_wpoffset;
> +	inode->i_blocks = zone->len;
> +
> +	inode->i_op = &zonefs_file_inode_operations;
> +	inode->i_fop = &zonefs_file_operations;
> +	inode->i_mapping->a_ops = &zonefs_file_aops;
> +
> +	sb->s_maxbytes = max(zi->i_max_size, sb->s_maxbytes);
> +	sbi->s_blocks += zi->i_max_size >> sb->s_blocksize_bits;
> +	sbi->s_used_blocks += zi->i_wpoffset >> sb->s_blocksize_bits;
> +}
> +
> +static struct dentry *zonefs_create_inode(struct dentry *parent,
> +					const char *name, struct blk_zone *zone)
> +{
> +	struct inode *dir = d_inode(parent);
> +	struct dentry *dentry;
> +	struct inode *inode;
> +
> +	dentry = d_alloc_name(parent, name);
> +	if (!dentry)
> +		return NULL;
> +
> +	inode = new_inode(parent->d_sb);
> +	if (!inode)
> +		goto out;
> +
> +	inode->i_ino = get_next_ino();
> +	inode->i_ctime = inode->i_mtime = inode->i_atime = dir->i_ctime;
> +	if (zone)
> +		zonefs_init_file_inode(inode, zone);
> +	else
> +		zonefs_init_dir_inode(dir, inode);
> +	d_add(dentry, inode);
> +	dir->i_size++;
> +
> +	return dentry;
> +
> +out:
> +	dput(dentry);
> +
> +	return NULL;
> +}
> +
> +static char *zgroups_name[ZONEFS_ZTYPE_MAX] = { "cnv", "seq" };
> +
> +struct zonefs_zone_data {
> +	struct super_block *sb;
> +	unsigned int nr_zones[ZONEFS_ZTYPE_MAX];
> +	struct blk_zone *zones;
> +};
> +
> +/*
> + * Create a zone group and populate it with zone files.
> + */
> +static int zonefs_create_zgroup(struct zonefs_zone_data *zd,
> +				enum zonefs_ztype type)
> +{
> +	struct super_block *sb = zd->sb;
> +	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
> +	struct blk_zone *zone, *next, *end;
> +	char name[ZONEFS_NAME_MAX];
> +	struct dentry *dir;
> +	unsigned int n = 0;
> +
> +	/* If the group is empty, there is nothing to do */
> +	if (!zd->nr_zones[type])
> +		return 0;
> +
> +	dir = zonefs_create_inode(sb->s_root, zgroups_name[type], NULL);
> +	if (!dir)
> +		return -ENOMEM;
> +
> +	/*
> +	 * The first zone contains the super block: skip it.
> +	 */
> +	end = zd->zones + blkdev_nr_zones(sb->s_bdev->bd_disk);
> +	for (zone = &zd->zones[1]; zone < end; zone = next) {
> +
> +		next = zone + 1;
> +		if (zonefs_zone_type(zone) != type)
> +			continue;
> +
> +		/*
> +		 * For conventional zones, contiguous zones can be aggregated
> +		 * together to form larger files.
> +		 * Note that this overwrites the length of the first zone of
> +		 * the set of contiguous zones aggregated together.
> +		 * Only zones with the same condition can be agreggated so that
> +		 * offline zones are excluded and readonly zones are aggregated
> +		 * together into a read only file.
> +		 */
> +		if (type == ZONEFS_ZTYPE_CNV &&
> +		    (sbi->s_features & ZONEFS_F_AGGRCNV)) {
> +			for (; next < end; next++) {
> +				if (zonefs_zone_type(next) != type ||
> +				    next->cond != zone->cond)
> +					break;
> +				zone->len += next->len;
> +			}
> +		}
> +
> +		/*
> +		 * Use the file number within its group as file name.
> +		 */
> +		snprintf(name, ZONEFS_NAME_MAX - 1, "%u", n);
> +		if (!zonefs_create_inode(dir, name, zone))
> +			return -ENOMEM;
> +
> +		n++;
> +	}
> +
> +	zonefs_info(sb, "Zone group \"%s\" has %u file%s\n",
> +		    zgroups_name[type], n, n > 1 ? "s" : "");
> +
> +	sbi->s_nr_files[type] = n;
> +
> +	return 0;
> +}
> +
> +static int zonefs_get_zone_info_cb(struct blk_zone *zone, unsigned int idx,
> +				   void *data)
> +{
> +	struct zonefs_zone_data *zd = data;
> +
> +	/*
> +	 * Count the number of usable zones: the first zone at index 0 contains
> +	 * the super block and is ignored.
> +	 */
> +	switch (zone->type) {
> +	case BLK_ZONE_TYPE_CONVENTIONAL:
> +		zone->wp = zone->start + zone->len;
> +		if (idx)
> +			zd->nr_zones[ZONEFS_ZTYPE_CNV]++;
> +		break;
> +	case BLK_ZONE_TYPE_SEQWRITE_REQ:
> +	case BLK_ZONE_TYPE_SEQWRITE_PREF:
> +		if (idx)
> +			zd->nr_zones[ZONEFS_ZTYPE_SEQ]++;
> +		break;
> +	default:
> +		zonefs_err(zd->sb, "Unsupported zone type 0x%x\n",
> +			   zone->type);
> +		return -EIO;
> +	}
> +
> +	memcpy(&zd->zones[idx], zone, sizeof(struct blk_zone));
> +
> +	return 0;
> +}
> +
> +static int zonefs_get_zone_info(struct zonefs_zone_data *zd)
> +{
> +	struct block_device *bdev = zd->sb->s_bdev;
> +	int ret;
> +
> +	zd->zones = kvcalloc(blkdev_nr_zones(bdev->bd_disk),
> +			     sizeof(struct blk_zone), GFP_KERNEL);
> +	if (!zd->zones)
> +		return -ENOMEM;
> +
> +	/* Get zones information */
> +	ret = blkdev_report_zones(bdev, 0, BLK_ALL_ZONES,
> +				  zonefs_get_zone_info_cb, zd);
> +	if (ret < 0) {
> +		zonefs_err(zd->sb, "Zone report failed %d\n", ret);
> +		return ret;
> +	}
> +
> +	if (ret != blkdev_nr_zones(bdev->bd_disk)) {
> +		zonefs_err(zd->sb, "Invalid zone report (%d/%u zones)\n",
> +			   ret, blkdev_nr_zones(bdev->bd_disk));
> +		return -EIO;
> +	}
> +
> +	return 0;
> +}
> +
> +static inline void zonefs_cleanup_zone_info(struct zonefs_zone_data *zd)
> +{
> +	kvfree(zd->zones);
> +}
> +
> +/*
> + * Read super block information from the device.
> + */
> +static int zonefs_read_super(struct super_block *sb)
> +{
> +	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
> +	struct zonefs_super *super;
> +	u32 crc, stored_crc;
> +	struct page *page;
> +	struct bio_vec bio_vec;
> +	struct bio bio;
> +	int ret;
> +
> +	page = alloc_page(GFP_KERNEL);
> +	if (!page)
> +		return -ENOMEM;
> +
> +	bio_init(&bio, &bio_vec, 1);
> +	bio.bi_iter.bi_sector = 0;
> +	bio_set_dev(&bio, sb->s_bdev);
> +	bio_set_op_attrs(&bio, REQ_OP_READ, 0);
> +	bio_add_page(&bio, page, PAGE_SIZE, 0);
> +
> +	ret = submit_bio_wait(&bio);
> +	if (ret)
> +		goto out;
> +
> +	super = page_address(page);
> +
> +	stored_crc = le32_to_cpu(super->s_crc);
> +	super->s_crc = 0;
> +	crc = crc32(~0U, (unsigned char *)super, sizeof(struct zonefs_super));
> +	if (crc != stored_crc) {
> +		zonefs_err(sb, "Invalid checksum (Expected 0x%08x, got 0x%08x)",
> +			   crc, stored_crc);
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	ret = -EINVAL;
> +	if (le32_to_cpu(super->s_magic) != ZONEFS_MAGIC)
> +		goto out;
> +
> +	sbi->s_features = le64_to_cpu(super->s_features);
> +	if (sbi->s_features & ~ZONEFS_F_DEFINED_FEATURES) {
> +		zonefs_err(sb, "Unknown features set 0x%llx\n",
> +			   sbi->s_features);
> +		goto out;
> +	}
> +
> +	if (sbi->s_features & ZONEFS_F_UID) {
> +		sbi->s_uid = make_kuid(current_user_ns(),
> +				       le32_to_cpu(super->s_uid));
> +		if (!uid_valid(sbi->s_uid)) {
> +			zonefs_err(sb, "Invalid UID feature\n");
> +			goto out;
> +		}
> +	}
> +
> +	if (sbi->s_features & ZONEFS_F_GID) {
> +		sbi->s_gid = make_kgid(current_user_ns(),
> +				       le32_to_cpu(super->s_gid));
> +		if (!gid_valid(sbi->s_gid)) {
> +			zonefs_err(sb, "Invalid GID feature\n");
> +			goto out;
> +		}
> +	}
> +
> +	if (sbi->s_features & ZONEFS_F_PERM)
> +		sbi->s_perm = le32_to_cpu(super->s_perm);
> +
> +	if (memchr_inv(super->s_reserved, 0, sizeof(super->s_reserved))) {
> +		zonefs_err(sb, "Reserved area is being used\n");
> +		goto out;
> +	}
> +
> +	uuid_copy(&sbi->s_uuid, (uuid_t *)super->s_uuid);
> +	ret = 0;
> +
> +out:
> +	__free_page(page);
> +
> +	return ret;
> +}
> +
> +/*
> + * Check that the device is zoned. If it is, get the list of zones and create
> + * sub-directories and files according to the device zone configuration and
> + * format options.
> + */
> +static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
> +{
> +	struct zonefs_zone_data zd;
> +	struct zonefs_sb_info *sbi;
> +	struct inode *inode;
> +	enum zonefs_ztype t;
> +	int ret;
> +
> +	if (!bdev_is_zoned(sb->s_bdev)) {
> +		zonefs_err(sb, "Not a zoned block device\n");
> +		return -EINVAL;
> +	}
> +
> +	/*
> +	 * Initialize super block information: the maximum file size is updated
> +	 * when the zone files are created so that the format option
> +	 * ZONEFS_F_AGGRCNV which increases the maximum file size of a file
> +	 * beyond the zone size is taken into account.
> +	 */
> +	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
> +	if (!sbi)
> +		return -ENOMEM;
> +
> +	spin_lock_init(&sbi->s_lock);
> +	sb->s_fs_info = sbi;
> +	sb->s_magic = ZONEFS_MAGIC;
> +	sb->s_maxbytes = 0;
> +	sb->s_op = &zonefs_sops;
> +	sb->s_time_gran	= 1;
> +
> +	/*
> +	 * The block size is always equal to the device physical sector size to
> +	 * ensure that writes on 512e devices (512B logical block and 4KB
> +	 * physical block) are always aligned to the device physical blocks
> +	 * (as required for writes to sequential zones on ZBC/ZAC disks).
> +	 */
> +	sb_set_blocksize(sb, bdev_physical_block_size(sb->s_bdev));
> +	sbi->s_blocksize_mask = sb->s_blocksize - 1;
> +	sbi->s_uid = GLOBAL_ROOT_UID;
> +	sbi->s_gid = GLOBAL_ROOT_GID;
> +	sbi->s_perm = 0640;
> +
> +	ret = zonefs_read_super(sb);
> +	if (ret)
> +		return ret;
> +
> +	memset(&zd, 0, sizeof(struct zonefs_zone_data));
> +	zd.sb = sb;
> +	ret = zonefs_get_zone_info(&zd);
> +	if (ret)
> +		goto out;
> +
> +	zonefs_info(sb, "Mounting %u zones",
> +		    blkdev_nr_zones(sb->s_bdev->bd_disk));
> +
> +	/* Create root directory inode */
> +	ret = -ENOMEM;
> +	inode = new_inode(sb);
> +	if (!inode)
> +		goto out;
> +
> +	inode->i_ino = get_next_ino();
> +	inode->i_mode = S_IFDIR | 0555;
> +	inode->i_ctime = inode->i_mtime = inode->i_atime = current_time(inode);
> +	inode->i_op = &zonefs_dir_inode_operations;
> +	inode->i_fop = &simple_dir_operations;
> +	set_nlink(inode, 2);
> +
> +	sb->s_root = d_make_root(inode);
> +	if (!sb->s_root)
> +		goto out;
> +
> +	/* Create and populate files in zone groups directories */
> +	for (t = 0; t < ZONEFS_ZTYPE_MAX; t++) {
> +		ret = zonefs_create_zgroup(&zd, t);
> +		if (ret)
> +			break;
> +	}
> +
> +out:
> +	zonefs_cleanup_zone_info(&zd);
> +
> +	return ret;
> +}
> +
> +static struct dentry *zonefs_mount(struct file_system_type *fs_type,
> +				   int flags, const char *dev_name, void *data)
> +{
> +	return mount_bdev(fs_type, flags, dev_name, data, zonefs_fill_super);
> +}
> +
> +static void zonefs_kill_super(struct super_block *sb)
> +{
> +	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
> +
> +	kfree(sbi);
> +	if (sb->s_root)
> +		d_genocide(sb->s_root);
> +	kill_block_super(sb);
> +}
> +
> +/*
> + * File system definition and registration.
> + */
> +static struct file_system_type zonefs_type = {
> +	.owner		= THIS_MODULE,
> +	.name		= "zonefs",
> +	.mount		= zonefs_mount,
> +	.kill_sb	= zonefs_kill_super,
> +	.fs_flags	= FS_REQUIRES_DEV,
> +};
> +
> +static int __init zonefs_init_inodecache(void)
> +{
> +	zonefs_inode_cachep = kmem_cache_create("zonefs_inode_cache",
> +			sizeof(struct zonefs_inode_info), 0,
> +			(SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD | SLAB_ACCOUNT),
> +			NULL);
> +	if (zonefs_inode_cachep == NULL)
> +		return -ENOMEM;
> +	return 0;
> +}
> +
> +static void zonefs_destroy_inodecache(void)
> +{
> +	/*
> +	 * Make sure all delayed rcu free inodes are flushed before we
> +	 * destroy the inode cache.
> +	 */
> +	rcu_barrier();
> +	kmem_cache_destroy(zonefs_inode_cachep);
> +}
> +
> +static int __init zonefs_init(void)
> +{
> +	int ret;
> +
> +	BUILD_BUG_ON(sizeof(struct zonefs_super) != ZONEFS_SUPER_SIZE);
> +
> +	ret = zonefs_init_inodecache();
> +	if (ret)
> +		return ret;
> +
> +	ret = register_filesystem(&zonefs_type);
> +	if (ret) {
> +		zonefs_destroy_inodecache();
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void __exit zonefs_exit(void)
> +{
> +	zonefs_destroy_inodecache();
> +	unregister_filesystem(&zonefs_type);
> +}
> +
> +MODULE_AUTHOR("Damien Le Moal");
> +MODULE_DESCRIPTION("Zone file system for zoned block devices");
> +MODULE_LICENSE("GPL");
> +module_init(zonefs_init);
> +module_exit(zonefs_exit);
> diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
> new file mode 100644
> index 000000000000..5862e17573ba
> --- /dev/null
> +++ b/fs/zonefs/zonefs.h
> @@ -0,0 +1,175 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Simple zone file system for zoned block devices.
> + *
> + * Copyright (C) 2019 Western Digital Corporation or its affiliates.
> + */
> +#ifndef __ZONEFS_H__
> +#define __ZONEFS_H__
> +
> +#include <linux/fs.h>
> +#include <linux/magic.h>
> +#include <linux/uuid.h>
> +#include <linux/mutex.h>
> +#include <linux/rwsem.h>
> +
> +/*
> + * Maximum length of file names: this only needs to be large enough to fit
> + * the zone group directory names and a decimal zone number for file names.
> + * 16 characters is plenty.
> + */
> +#define ZONEFS_NAME_MAX		16
> +
> +/*
> + * Zone types: ZONEFS_ZTYPE_SEQ is used for all sequential zone types
> + * defined in linux/blkzoned.h, that is, BLK_ZONE_TYPE_SEQWRITE_REQ and
> + * BLK_ZONE_TYPE_SEQWRITE_PREF.
> + */
> +enum zonefs_ztype {
> +	ZONEFS_ZTYPE_CNV,
> +	ZONEFS_ZTYPE_SEQ,
> +	ZONEFS_ZTYPE_MAX,
> +};
> +
> +static inline enum zonefs_ztype zonefs_zone_type(struct blk_zone *zone)
> +{
> +	if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL)
> +		return ZONEFS_ZTYPE_CNV;
> +	return ZONEFS_ZTYPE_SEQ;
> +}
> +
> +/*
> + * In-memory inode data.
> + */
> +struct zonefs_inode_info {
> +	struct inode		i_vnode;
> +
> +	/* File zone type */
> +	enum zonefs_ztype	i_ztype;
> +
> +	/* File zone start sector (512B unit) */
> +	sector_t		i_zsector;
> +
> +	/* File zone write pointer position (sequential zones only) */
> +	loff_t			i_wpoffset;
> +
> +	/* File maximum size */
> +	loff_t			i_max_size;
> +
> +	/*
> +	 * To serialise fully against both syscall and mmap based IO and
> +	 * sequential file truncation, two locks are used. For serializing
> +	 * zonefs_seq_file_truncate() against zonefs_iomap_begin(), that is,
> +	 * file truncate operations against block mapping, i_truncate_mutex is
> +	 * used. i_truncate_mutex also protects against concurrent accesses
> +	 * and changes to the inode private data, and in particular changes to
> +	 * a sequential file size on completion of direct IO writes.
> +	 * Serialization of mmap read IOs with truncate and syscall IO
> +	 * operations is done with i_mmap_sem in addition to i_truncate_mutex.
> +	 * Only zonefs_seq_file_truncate() takes both lock (i_mmap_sem first,
> +	 * i_truncate_mutex second).
> +	 */
> +	struct mutex		i_truncate_mutex;
> +	struct rw_semaphore	i_mmap_sem;
> +};
> +
> +static inline struct zonefs_inode_info *ZONEFS_I(struct inode *inode)
> +{
> +	return container_of(inode, struct zonefs_inode_info, i_vnode);
> +}
> +
> +/*
> + * On-disk super block (block 0).
> + */
> +#define ZONEFS_LABEL_LEN	64
> +#define ZONEFS_UUID_SIZE	16
> +#define ZONEFS_SUPER_SIZE	4096
> +
> +struct zonefs_super {
> +
> +	/* Magic number */
> +	__le32		s_magic;
> +
> +	/* Checksum */
> +	__le32		s_crc;
> +
> +	/* Volume label */
> +	char		s_label[ZONEFS_LABEL_LEN];
> +
> +	/* 128-bit uuid */
> +	__u8		s_uuid[ZONEFS_UUID_SIZE];
> +
> +	/* Features */
> +	__le64		s_features;
> +
> +	/* UID/GID to use for files */
> +	__le32		s_uid;
> +	__le32		s_gid;
> +
> +	/* File permissions */
> +	__le32		s_perm;
> +
> +	/* Padding to ZONEFS_SUPER_SIZE bytes */
> +	__u8		s_reserved[3988];
> +
> +} __packed;
> +
> +/*
> + * Feature flags: used on disk in the s_features field of struct zonefs_super
> + * and in-memory in the s_feartures field of struct zonefs_sb_info.
> + */
> +enum zonefs_features {
> +	/*
> +	 * Aggregate contiguous conventional zones into a single file.
> +	 */
> +	ZONEFS_F_AGGRCNV = 1ULL << 0,
> +	/*
> +	 * Use super block specified UID for files instead of default.
> +	 */
> +	ZONEFS_F_UID = 1ULL << 1,
> +	/*
> +	 * Use super block specified GID for files instead of default.
> +	 */
> +	ZONEFS_F_GID = 1ULL << 2,
> +	/*
> +	 * Use super block specified file permissions instead of default 640.
> +	 */
> +	ZONEFS_F_PERM = 1ULL << 3,
> +};
> +
> +#define ZONEFS_F_DEFINED_FEATURES \
> +	(ZONEFS_F_AGGRCNV | ZONEFS_F_UID | ZONEFS_F_GID | ZONEFS_F_PERM)
> +
> +/*
> + * In-memory Super block information.
> + */
> +struct zonefs_sb_info {
> +
> +	spinlock_t		s_lock;
> +
> +	unsigned long long	s_features;
> +	kuid_t			s_uid;
> +	kgid_t			s_gid;
> +	umode_t			s_perm;
> +	uuid_t			s_uuid;
> +	loff_t			s_blocksize_mask;
> +
> +	unsigned int		s_nr_files[ZONEFS_ZTYPE_MAX];
> +
> +	loff_t			s_blocks;
> +	loff_t			s_used_blocks;
> +};
> +
> +static inline struct zonefs_sb_info *ZONEFS_SB(struct super_block *sb)
> +{
> +	return sb->s_fs_info;
> +}
> +
> +#define zonefs_info(sb, format, args...)	\
> +	pr_info("zonefs (%s): " format, sb->s_id, ## args)
> +#define zonefs_err(sb, format, args...)	\
> +	pr_err("zonefs (%s) ERROR: " format, sb->s_id, ## args)
> +#define zonefs_warn(sb, format, args...)	\
> +	pr_warn("zonefs (%s) WARN: " format, sb->s_id, ## args)
> +
> +#endif
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 3ac436376d79..d78064007b17 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -87,6 +87,7 @@
>  #define NSFS_MAGIC		0x6e736673
>  #define BPF_FS_MAGIC		0xcafe4a11
>  #define AAFS_MAGIC		0x5a3c69f0
> +#define ZONEFS_MAGIC		0x5a4f4653
>  
>  /* Since UDF 2.01 is ISO 13346 based... */
>  #define UDF_SUPER_MAGIC		0x15013346
> -- 
> 2.24.1
>
Dave Chinner Jan. 22, 2020, 1:57 a.m. UTC | #2
Hi Damien,

I've finally had a chance to recover from LCA, catch up and look at
this again. Overall, pretty good, but a few comments below....

On Tue, Jan 21, 2020 at 03:58:45PM +0900, Damien Le Moal wrote:
> zonefs is a very simple file system exposing each zone of a zoned block
> device as a file. Unlike a regular file system with zoned block device
> support (e.g. f2fs), zonefs does not hide the sequential write
> constraint of zoned block devices to the user. Files representing
> sequential write zones of the device must be written sequentially
> starting from the end of the file (append only writes).

....

> --- /dev/null
> +++ b/fs/zonefs/super.c
> @@ -0,0 +1,1178 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Simple zone file system for zoned block devices.
> + *
> + * Copyright (C) 2019 Western Digital Corporation or its affiliates.
> + */
> +#include <linux/module.h>
> +#include <linux/fs.h>
> +#include <linux/magic.h>
> +#include <linux/iomap.h>
> +#include <linux/init.h>
> +#include <linux/slab.h>
> +#include <linux/blkdev.h>
> +#include <linux/statfs.h>
> +#include <linux/writeback.h>
> +#include <linux/quotaops.h>
> +#include <linux/seq_file.h>
> +#include <linux/parser.h>
> +#include <linux/uio.h>
> +#include <linux/mman.h>
> +#include <linux/sched/mm.h>
> +#include <linux/crc32.h>
> +
> +#include "zonefs.h"
> +
> +static int zonefs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> +			      unsigned int flags, struct iomap *iomap,
> +			      struct iomap *srcmap)
> +{
> +	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> +	loff_t max_isize = zi->i_max_size;
> +	loff_t isize;
> +
> +	/*
> +	 * For sequential zones, enforce direct IO writes. This is already
> +	 * checked when writes are issued, so warn about this here if we
> +	 * get buffered write to a sequential file inode.
> +	 */
> +	if (WARN_ON_ONCE(zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> +			 (flags & IOMAP_WRITE) && !(flags & IOMAP_DIRECT)))
> +		return -EIO;
> +
> +	/*
> +	 * For all zones, all blocks are always mapped. For sequential zones,
> +	 * all blocks after the write pointer (inode size) are always unwritten.
> +	 */
> +	mutex_lock(&zi->i_truncate_mutex);
> +	isize = i_size_read(inode);
> +	if (offset >= isize) {
> +		length = min(length, max_isize - offset);
> +		if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
> +			iomap->type = IOMAP_MAPPED;
> +		else
> +			iomap->type = IOMAP_UNWRITTEN;
> +	} else {
> +		length = min(length, isize - offset);
> +		iomap->type = IOMAP_MAPPED;
> +	}

Something was bugging me about this, and reading the rest of the
patch it finally triggered. For conventional zones, inode->i_size =
zi->i_max_size, and so if offset >= isize for a conventional
zone then this:

	length = min(length, max_isize - offset);

is going to result in length <= 0 and we return a negative length
iomap.

IOWs, this case should only trigger for IO into sequential zones,
as it appears to be prevented at higher layers for conventional
zones by explicit checks against i_max_size and/or
iov_iter_truncate() calls to ensure user IOs are limited to within
i_max_size.

Hence it looks to me that triggering the (offset >= isize) case here
for conventional zones is a WARN_ON_ONCE() and return -EIO
situation...

SO, perhaps:

	isize = i_size_read(inode);
	if (offset >= isize) {
		if (WARN_ON_ONCE(i->i_ztype == ZONEFS_ZTYPE_CNV)) {
			/* drop locks */
			return -EIO;
		}
		length = min(length, max_isize - offset);
		iomap->type = IOMAP_UNWRITTEN;
	} else {
		length = min(length, isize - offset);
		iomap->type = IOMAP_MAPPED;
	}

This also seems tailored around the call from zonefs_map_blocks()
which tries to map the entire zone (length = zi->i_max_size) for
writeback mappings. Hence the length in this case always requires
clamping to zi->i_max_size - offset. Again, there's an issue here:

> +static int zonefs_map_blocks(struct iomap_writepage_ctx *wpc,
> +			     struct inode *inode, loff_t offset)
> +{
> +	if (offset >= wpc->iomap.offset &&
> +	    offset < wpc->iomap.offset + wpc->iomap.length)
> +		return 0;
> +
> +	memset(&wpc->iomap, 0, sizeof(wpc->iomap));
> +	return zonefs_iomap_begin(inode, offset, ZONEFS_I(inode)->i_max_size,
> +				  0, &wpc->iomap, NULL);

Where we pass flags = 0 into zonefs_iomap_begin(), and so there is
no checking that this writeback code path is only executing against
a conventional zone. I.e. the comments and checks in
zonefs_iomap_begin() relate only to user IO call paths, but don't
validate or comment on the writeback path callers, and there's no
comments or checks here that the inode points at a conventional
zone, either....

> +static vm_fault_t zonefs_filemap_fault(struct vm_fault *vmf)
> +{
> +	struct zonefs_inode_info *zi = ZONEFS_I(file_inode(vmf->vma->vm_file));
> +	vm_fault_t ret;
> +
> +	down_read(&zi->i_mmap_sem);
> +	ret = filemap_fault(vmf);
> +	up_read(&zi->i_mmap_sem);
> +
> +	return ret;
> +}
> +
> +static vm_fault_t zonefs_filemap_page_mkwrite(struct vm_fault *vmf)
> +{
> +	struct inode *inode = file_inode(vmf->vma->vm_file);
> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> +	vm_fault_t ret;
> +
> +	sb_start_pagefault(inode->i_sb);
> +	file_update_time(vmf->vma->vm_file);
> +
> +	/* Serialize against truncates */
> +	down_read(&zi->i_mmap_sem);
> +	ret = iomap_page_mkwrite(vmf, &zonefs_iomap_ops);
> +	up_read(&zi->i_mmap_sem);
> +
> +	sb_end_pagefault(inode->i_sb);
> +	return ret;
> +}

Should there be a WARN_ON_ONCE(zi->zi_type != ZONEFS_ZTYPE_CNV) in
here?

> +static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> +	loff_t max_pos;
> +	size_t count;
> +	ssize_t ret;
> +
> +	if (iocb->ki_pos >= zi->i_max_size)
> +		return 0;
> +
> +	if (iocb->ki_flags & IOCB_NOWAIT) {
> +		if (!inode_trylock_shared(inode))
> +			return -EAGAIN;
> +	} else {
> +		inode_lock_shared(inode);
> +	}

We should really turn that into a generic helper. This pattern is
being replicated all over the place. Not in this patchset, though...

> +static int zonefs_report_zones_err_cb(struct blk_zone *zone, unsigned int idx,
> +				      void *data)
> +{
> +	struct inode *inode = data;
> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> +	loff_t pos;
> +
> +	/*
> +	 * The condition of the zone may have change. Check it and adjust the
> +	 * inode information as needed, similarly to zonefs_init_file_inode().
> +	 */
> +	if (zone->cond == BLK_ZONE_COND_OFFLINE) {
> +		inode->i_flags |= S_IMMUTABLE;
> +		inode->i_mode &= ~0777;
> +		zone->wp = zone->start;
> +	} else if (zone->cond == BLK_ZONE_COND_READONLY) {
> +		inode->i_flags |= S_IMMUTABLE;
> +		inode->i_mode &= ~0222;
> +	}

This exact code is repeated in zonefs_init_file_inode(). Maybe it
should be a helper function?

> +
> +	pos = (zone->wp - zone->start) << SECTOR_SHIFT;
> +	zi->i_wpoffset = pos;
> +	if (i_size_read(inode) != pos) {
> +		zonefs_update_stats(inode, pos);
> +		i_size_write(inode, pos);
> +	}

What happens if this decreases the size of the zone? don't we need
to invalidate the page cache beyond the new EOF in this case (i.e.
it's a truncate operation)?

> +static int zonefs_seq_file_write_failed(struct inode *inode, int error)
> +{
> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> +	struct super_block *sb = inode->i_sb;
> +	sector_t sector = zi->i_zsector;
> +	unsigned int nofs_flag;
> +	int ret;
> +
> +	zonefs_warn(sb, "Updating inode zone %llu info\n", sector);
> +
> +	/*
> +	 * blkdev_report_zones() uses GFP_KERNEL by default. Force execution as
> +	 * if GFP_NOFS was specified so that it will not end up recursing into
> +	 * the FS on memory allocation.
> +	 */
> +	nofs_flag = memalloc_nofs_save();
> +	ret = blkdev_report_zones(sb->s_bdev, sector, 1,
> +				  zonefs_report_zones_err_cb, inode);
> +	memalloc_nofs_restore(nofs_flag);

The comment is kinda redundant - it's explaining exactly what the
code does rather than why it needs this protection. i.e. the comment
should explain the recursion vector/deadlock that we are avoiding
here...

> +static int zonefs_file_dio_write_end(struct kiocb *iocb, ssize_t size, int ret,
> +				     unsigned int flags)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> +
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * Conventional zone file size is fixed to the zone size so there
> +	 * is no need to do anything.
> +	 */
> +	if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
> +		return 0;
> +
> +	mutex_lock(&zi->i_truncate_mutex);
> +
> +	if (size < 0) {
> +		ret = zonefs_seq_file_write_failed(inode, size);

Ok, so I see it is being called from IO completion context, whcih
means we'd want memalloc_noio_save() because the underlying bio
doesn't get freed until this whole completion runs, right?

> +static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> +	size_t count;
> +	ssize_t ret;
> +
> +	if (iocb->ki_flags & IOCB_NOWAIT) {
> +		if (!inode_trylock(inode))
> +			return -EAGAIN;
> +	} else {
> +		inode_lock(inode);
> +	}
> +
> +	ret = generic_write_checks(iocb, from);
> +	if (ret <= 0)
> +		goto out;
> +
> +	iov_iter_truncate(from, zi->i_max_size - iocb->ki_pos);
> +	count = iov_iter_count(from);

So count is initialised to the entire IO length....

> +
> +	/*
> +	 * Direct writes must be aligned to the block size, that is, the device
> +	 * physical sector size, to avoid errors when writing sequential zones
> +	 * on 512e devices (512B logical sector, 4KB physical sectors).
> +	 */
> +	if ((iocb->ki_pos | count) & sbi->s_blocksize_mask) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Enforce sequential writes (append only) in sequential zones.
> +	 */
> +	mutex_lock(&zi->i_truncate_mutex);
> +	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> +	    iocb->ki_pos != zi->i_wpoffset) {
> +		zonefs_err(inode->i_sb,
> +			   "Unaligned write at %llu + %zu (wp %llu)\n",
> +			   iocb->ki_pos, count,
> +			   zi->i_wpoffset);
> +		mutex_unlock(&zi->i_truncate_mutex);
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	mutex_unlock(&zi->i_truncate_mutex);
> +
> +	ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops, &zonefs_dio_ops,
> +			   is_sync_kiocb(iocb));
> +	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> +	    (ret > 0 || ret == -EIOCBQUEUED)) {
> +		if (ret > 0)
> +			count = ret;
> +		mutex_lock(&zi->i_truncate_mutex);
> +		zi->i_wpoffset += count;
> +		mutex_unlock(&zi->i_truncate_mutex);

Hmmmm. This looks problematic w.r.t. AIO. If we get -EIOCBQUEUED it
means the IO has been queued but not necessarily submitted, but
we update zi->i_wpoffset as though the entire AIO has laready
completed. ANd then we drop the inode_lock() and return, allowing
another AIO+DIO to be started.

Hence another concurrent sequential AIO+DIO write could now be
submitted and pass the above iocb->ki_pos != zi->i_wpoffset check.
Now we have two independent IOs in flight - one that is at the
current hardware write pointer offset, and one that is beyond it.

What happens if the block layer now re-orders these two IOs?


> +static struct dentry *zonefs_create_inode(struct dentry *parent,
> +					const char *name, struct blk_zone *zone)
> +{
> +	struct inode *dir = d_inode(parent);
> +	struct dentry *dentry;
> +	struct inode *inode;
> +
> +	dentry = d_alloc_name(parent, name);
> +	if (!dentry)
> +		return NULL;
> +
> +	inode = new_inode(parent->d_sb);
> +	if (!inode)
> +		goto out;
> +
> +	inode->i_ino = get_next_ino();

get_next_ino() doesn't guarantee inode number uniqueness (it's 32
bit and global across all filesystems so it can overflow). Are
duplicate inode numbers on this superblock an issue?

> +/*
> + * Read super block information from the device.
> + */
> +static int zonefs_read_super(struct super_block *sb)
> +{
> +	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
> +	struct zonefs_super *super;
> +	u32 crc, stored_crc;
> +	struct page *page;
> +	struct bio_vec bio_vec;
> +	struct bio bio;
> +	int ret;
> +
> +	page = alloc_page(GFP_KERNEL);
> +	if (!page)
> +		return -ENOMEM;
> +
> +	bio_init(&bio, &bio_vec, 1);
> +	bio.bi_iter.bi_sector = 0;
> +	bio_set_dev(&bio, sb->s_bdev);
> +	bio_set_op_attrs(&bio, REQ_OP_READ, 0);
> +	bio_add_page(&bio, page, PAGE_SIZE, 0);
> +
> +	ret = submit_bio_wait(&bio);
> +	if (ret)
> +		goto out;
> +
> +	super = page_address(page);
> +
> +	stored_crc = le32_to_cpu(super->s_crc);
> +	super->s_crc = 0;
> +	crc = crc32(~0U, (unsigned char *)super, sizeof(struct zonefs_super));
> +	if (crc != stored_crc) {
> +		zonefs_err(sb, "Invalid checksum (Expected 0x%08x, got 0x%08x)",
> +			   crc, stored_crc);
> +		ret = -EIO;
> +		goto out;
> +	}

Does this mean if mount or the kernel tries to autoprobe the
filesystem type on a device it will get -EIO and an "Invalid
checksum" error message rather than just silently returning -EINVAL
because....

> +	ret = -EINVAL;
> +	if (le32_to_cpu(super->s_magic) != ZONEFS_MAGIC)
> +		goto out;

... it isn't actually a zonefs filesystem?

i.e. shouldn't these checks be the other way around?

Cheers,

Dave.
Damien Le Moal Jan. 22, 2020, 10:07 a.m. UTC | #3
Dave,

On Wed, 2020-01-22 at 12:57 +1100, Dave Chinner wrote:
> [...]
> > +static int zonefs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> > +			      unsigned int flags, struct iomap *iomap,
> > +			      struct iomap *srcmap)
> > +{
> > +	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +	loff_t max_isize = zi->i_max_size;
> > +	loff_t isize;
> > +
> > +	/*
> > +	 * For sequential zones, enforce direct IO writes. This is already
> > +	 * checked when writes are issued, so warn about this here if we
> > +	 * get buffered write to a sequential file inode.
> > +	 */
> > +	if (WARN_ON_ONCE(zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> > +			 (flags & IOMAP_WRITE) && !(flags & IOMAP_DIRECT)))
> > +		return -EIO;
> > +
> > +	/*
> > +	 * For all zones, all blocks are always mapped. For sequential zones,
> > +	 * all blocks after the write pointer (inode size) are always unwritten.
> > +	 */
> > +	mutex_lock(&zi->i_truncate_mutex);
> > +	isize = i_size_read(inode);
> > +	if (offset >= isize) {
> > +		length = min(length, max_isize - offset);
> > +		if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
> > +			iomap->type = IOMAP_MAPPED;
> > +		else
> > +			iomap->type = IOMAP_UNWRITTEN;
> > +	} else {
> > +		length = min(length, isize - offset);
> > +		iomap->type = IOMAP_MAPPED;
> > +	}
> 
> Something was bugging me about this, and reading the rest of the
> patch it finally triggered. For conventional zones, inode->i_size =
> zi->i_max_size, and so if offset >= isize for a conventional
> zone then this:
> 
> 	length = min(length, max_isize - offset);
> 
> is going to result in length <= 0 and we return a negative length
> iomap.
> 
> IOWs, this case should only trigger for IO into sequential zones,
> as it appears to be prevented at higher layers for conventional
> zones by explicit checks against i_max_size and/or
> iov_iter_truncate() calls to ensure user IOs are limited to within
> i_max_size.
> 
> Hence it looks to me that triggering the (offset >= isize) case here
> for conventional zones is a WARN_ON_ONCE() and return -EIO
> situation...
> 
> SO, perhaps:
> 
> 	isize = i_size_read(inode);
> 	if (offset >= isize) {
> 		if (WARN_ON_ONCE(i->i_ztype == ZONEFS_ZTYPE_CNV)) {
> 			/* drop locks */
> 			return -EIO;
> 		}
> 		length = min(length, max_isize - offset);
> 		iomap->type = IOMAP_UNWRITTEN;
> 	} else {
> 		length = min(length, isize - offset);
> 		iomap->type = IOMAP_MAPPED;
> 	}

Yes, that is much better indeed. I will change this.

> This also seems tailored around the call from zonefs_map_blocks()
> which tries to map the entire zone (length = zi->i_max_size) for
> writeback mappings. Hence the length in this case always requires
> clamping to zi->i_max_size - offset. Again, there's an issue here:
> 
> > +static int zonefs_map_blocks(struct iomap_writepage_ctx *wpc,
> > +			     struct inode *inode, loff_t offset)
> > +{
> > +	if (offset >= wpc->iomap.offset &&
> > +	    offset < wpc->iomap.offset + wpc->iomap.length)
> > +		return 0;
> > +
> > +	memset(&wpc->iomap, 0, sizeof(wpc->iomap));
> > +	return zonefs_iomap_begin(inode, offset, ZONEFS_I(inode)->i_max_size,
> > +				  0, &wpc->iomap, NULL);
> 
> Where we pass flags = 0 into zonefs_iomap_begin(), and so there is
> no checking that this writeback code path is only executing against
> a conventional zone. I.e. the comments and checks in
> zonefs_iomap_begin() relate only to user IO call paths, but don't
> validate or comment on the writeback path callers, and there's no
> comments or checks here that the inode points at a conventional
> zone, either....

I do not understand your point here. Since all blocks are always
allocated for both conventional and sequential files, I think that
using i_max_size for calling zonefs_iomap_begin is OK: for conventional
zone files, any of these blocks can be written, both user direct or
through the page cache. No distinction is I think necessary. For
sequential zone files, only the blocks at "offset" can be written, and
that value must be equal to zi->i_wpoffset (which account for in-
flights writes). In both cases, exceeding the max file size is not
allowed so this check is common in zonefs_iomap_begin() to cover all
users and not just zonefs_map_blocks(). Did I get something wrong with
iomap workings ?

> 
> > +static vm_fault_t zonefs_filemap_fault(struct vm_fault *vmf)
> > +{
> > +	struct zonefs_inode_info *zi = ZONEFS_I(file_inode(vmf->vma->vm_file));
> > +	vm_fault_t ret;
> > +
> > +	down_read(&zi->i_mmap_sem);
> > +	ret = filemap_fault(vmf);
> > +	up_read(&zi->i_mmap_sem);
> > +
> > +	return ret;
> > +}
> > +
> > +static vm_fault_t zonefs_filemap_page_mkwrite(struct vm_fault *vmf)
> > +{
> > +	struct inode *inode = file_inode(vmf->vma->vm_file);
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +	vm_fault_t ret;
> > +
> > +	sb_start_pagefault(inode->i_sb);
> > +	file_update_time(vmf->vma->vm_file);
> > +
> > +	/* Serialize against truncates */
> > +	down_read(&zi->i_mmap_sem);
> > +	ret = iomap_page_mkwrite(vmf, &zonefs_iomap_ops);
> > +	up_read(&zi->i_mmap_sem);
> > +
> > +	sb_end_pagefault(inode->i_sb);
> > +	return ret;
> > +}
> 
> Should there be a WARN_ON_ONCE(zi->zi_type != ZONEFS_ZTYPE_CNV) in
> here?

Yes, that would be useful. I will add that.

> 
> > +static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +	loff_t max_pos;
> > +	size_t count;
> > +	ssize_t ret;
> > +
> > +	if (iocb->ki_pos >= zi->i_max_size)
> > +		return 0;
> > +
> > +	if (iocb->ki_flags & IOCB_NOWAIT) {
> > +		if (!inode_trylock_shared(inode))
> > +			return -EAGAIN;
> > +	} else {
> > +		inode_lock_shared(inode);
> > +	}
> 
> We should really turn that into a generic helper. This pattern is
> being replicated all over the place. Not in this patchset, though...

Yes, I saw that too. Will work on something later.

> > +static int zonefs_report_zones_err_cb(struct blk_zone *zone, unsigned int idx,
> > +				      void *data)
> > +{
> > +	struct inode *inode = data;
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +	loff_t pos;
> > +
> > +	/*
> > +	 * The condition of the zone may have change. Check it and adjust the
> > +	 * inode information as needed, similarly to zonefs_init_file_inode().
> > +	 */
> > +	if (zone->cond == BLK_ZONE_COND_OFFLINE) {
> > +		inode->i_flags |= S_IMMUTABLE;
> > +		inode->i_mode &= ~0777;
> > +		zone->wp = zone->start;
> > +	} else if (zone->cond == BLK_ZONE_COND_READONLY) {
> > +		inode->i_flags |= S_IMMUTABLE;
> > +		inode->i_mode &= ~0222;
> > +	}
> 
> This exact code is repeated in zonefs_init_file_inode(). Maybe it
> should be a helper function?

Yes, good idea.

> 
> > +
> > +	pos = (zone->wp - zone->start) << SECTOR_SHIFT;
> > +	zi->i_wpoffset = pos;
> > +	if (i_size_read(inode) != pos) {
> > +		zonefs_update_stats(inode, pos);
> > +		i_size_write(inode, pos);
> > +	}
> 
> What happens if this decreases the size of the zone? don't we need
> to invalidate the page cache beyond the new EOF in this case (i.e.
> it's a truncate operation)?

This is called only for direct write errors into sequential zones.
Since for that case we only deal with append direct writes, there is no
possibility of having any of the written data cached already. So even
if we get a short write or complete failure, no invalidation is needed.

Compared to errors for read operations in any zone, or conventional
zone files read/write errors, this error handling adds processing of
zone condition changes (error due to a zone going offline or read-
only). I could add the same treatment for all IO errors. I did not
since if we start seeing these zone conditions, it is likely that the
drive is about to die. So conventional zone writes and all read errors
are treated like on any other FS: only return the error to the user
without any drive-specific forensic done.

> 
> > +static int zonefs_seq_file_write_failed(struct inode *inode, int error)
> > +{
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +	struct super_block *sb = inode->i_sb;
> > +	sector_t sector = zi->i_zsector;
> > +	unsigned int nofs_flag;
> > +	int ret;
> > +
> > +	zonefs_warn(sb, "Updating inode zone %llu info\n", sector);
> > +
> > +	/*
> > +	 * blkdev_report_zones() uses GFP_KERNEL by default. Force execution as
> > +	 * if GFP_NOFS was specified so that it will not end up recursing into
> > +	 * the FS on memory allocation.
> > +	 */
> > +	nofs_flag = memalloc_nofs_save();
> > +	ret = blkdev_report_zones(sb->s_bdev, sector, 1,
> > +				  zonefs_report_zones_err_cb, inode);
> > +	memalloc_nofs_restore(nofs_flag);
> 
> The comment is kinda redundant - it's explaining exactly what the
> code does rather than why it needs this protection. i.e. the comment
> should explain the recursion vector/deadlock that we are avoiding
> here...

Yes. Changed it to:

/*
 * Report zones memory allocation could trigger a recursion into zonefs
 * due to memory reclaim. Since this is always called with the inode
 * truncate mutex lock being held, avoid the potential recursion
 * deadlock using a GFP_NOFS allocation.
 */

> 
> > +static int zonefs_file_dio_write_end(struct kiocb *iocb, ssize_t size, int ret,
> > +				     unsigned int flags)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +
> > +	if (ret)
> > +		return ret;
> > +
> > +	/*
> > +	 * Conventional zone file size is fixed to the zone size so there
> > +	 * is no need to do anything.
> > +	 */
> > +	if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
> > +		return 0;
> > +
> > +	mutex_lock(&zi->i_truncate_mutex);
> > +
> > +	if (size < 0) {
> > +		ret = zonefs_seq_file_write_failed(inode, size);
> 
> Ok, so I see it is being called from IO completion context, whcih
> means we'd want memalloc_noio_save() because the underlying bio
> doesn't get freed until this whole completion runs, right?

Yes, the failed BIO is freed only after the report zone is done. But
more than GFP_NOIO, we want GFP_NOFS for the reason stated above.

> 
> > +static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +	size_t count;
> > +	ssize_t ret;
> > +
> > +	if (iocb->ki_flags & IOCB_NOWAIT) {
> > +		if (!inode_trylock(inode))
> > +			return -EAGAIN;
> > +	} else {
> > +		inode_lock(inode);
> > +	}
> > +
> > +	ret = generic_write_checks(iocb, from);
> > +	if (ret <= 0)
> > +		goto out;
> > +
> > +	iov_iter_truncate(from, zi->i_max_size - iocb->ki_pos);
> > +	count = iov_iter_count(from);
> 
> So count is initialised to the entire IO length....

Well, yes, count reflects the truncated iov_iter size. This is
necessary for the AIO case when iomap_dio_rw() returns -EIOCBQUEUED so
that we can account for the inflight AIOs for an eventual subsequent
AIO submission by the user (see next comment below). For sync writes
(or AIOs that completed very quickly), the final value for count is
updated using iomap_dio_rw() return value.

> > +
> > +	/*
> > +	 * Direct writes must be aligned to the block size, that is, the device
> > +	 * physical sector size, to avoid errors when writing sequential zones
> > +	 * on 512e devices (512B logical sector, 4KB physical sectors).
> > +	 */
> > +	if ((iocb->ki_pos | count) & sbi->s_blocksize_mask) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * Enforce sequential writes (append only) in sequential zones.
> > +	 */
> > +	mutex_lock(&zi->i_truncate_mutex);
> > +	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> > +	    iocb->ki_pos != zi->i_wpoffset) {
> > +		zonefs_err(inode->i_sb,
> > +			   "Unaligned write at %llu + %zu (wp %llu)\n",
> > +			   iocb->ki_pos, count,
> > +			   zi->i_wpoffset);
> > +		mutex_unlock(&zi->i_truncate_mutex);
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +	mutex_unlock(&zi->i_truncate_mutex);
> > +
> > +	ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops, &zonefs_dio_ops,
> > +			   is_sync_kiocb(iocb));
> > +	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> > +	    (ret > 0 || ret == -EIOCBQUEUED)) {
> > +		if (ret > 0)
> > +			count = ret;
> > +		mutex_lock(&zi->i_truncate_mutex);
> > +		zi->i_wpoffset += count;
> > +		mutex_unlock(&zi->i_truncate_mutex);
> 
> Hmmmm. This looks problematic w.r.t. AIO. If we get -EIOCBQUEUED it
> means the IO has been queued but not necessarily submitted, but
> we update zi->i_wpoffset as though the entire AIO has laready
> completed. ANd then we drop the inode_lock() and return, allowing
> another AIO+DIO to be started.
> 
> Hence another concurrent sequential AIO+DIO write could now be
> submitted and pass the above iocb->ki_pos != zi->i_wpoffset check.
> Now we have two independent IOs in flight - one that is at the
> current hardware write pointer offset, and one that is beyond it.
> 
> What happens if the block layer now re-orders these two IOs?

If the correct block scheduler is used, that is mq-deadline, there is
no possibility of write reordering. mq-deadline is now the default IO
scheduler for zoned block devices and the only one that is allowed
(beside "none"). mq-deadline uses a zone write locking mechanism to
ensure that there is no reordering of write requests, either by the
block layer itself or by bad hardware (SATA AHCI adapters are
notoriously bad and silently reorder requests all the time, even for
SMR disks).

With this mechanism, the user can safely use io_submit() beyond a
single IO and zonefs check that the set of AIOs being submitted are all
sequential starting from the zi->i_wpoffset "soft" write pointer that
reflects the already in-flight AIOs. Multiple io_submit() of multiple
AIOs can be executed in sequence without needing to limit to a single
AIO at a time.

If a disk error occurs along the way, the seq file size and zi-
>i_wpoffset are updated using the report zone result. All in-flight or
submitted AIOs after the failed one will be failed by the disk itself
due to the their now unaligned position. These failures will not change
again the file size or zi->i_wpoffset since the zone information will
be the same after all failures. The user only has to look at the file
size again to know were to restart writing from without even needing to
wait for all in-flight AIO to complete with an error (but that would of
course be the recommended practice).

In other word, we assume here that all write succeed and allow high-
queue depth submission using zi->i_wpoffset as a "soft" write pointer.

> > +static struct dentry *zonefs_create_inode(struct dentry *parent,
> > +					const char *name, struct blk_zone *zone)
> > +{
> > +	struct inode *dir = d_inode(parent);
> > +	struct dentry *dentry;
> > +	struct inode *inode;
> > +
> > +	dentry = d_alloc_name(parent, name);
> > +	if (!dentry)
> > +		return NULL;
> > +
> > +	inode = new_inode(parent->d_sb);
> > +	if (!inode)
> > +		goto out;
> > +
> > +	inode->i_ino = get_next_ino();
> 
> get_next_ino() doesn't guarantee inode number uniqueness (it's 32
> bit and global across all filesystems so it can overflow). Are
> duplicate inode numbers on this superblock an issue?

Haa. Indeed. I missed that point. It would be nicer to have unique and
mount persistent per-volume inode numbers, especially considering the
amount of inodes that a large server with hundreds of drives would
generate. It is trivial to create unique inode numbers using each file
zone number on the disk (ordered by start sector) and using numbers
after that for the root directory and per zone type sub-directories.
That makes the inode numbers persistent across remounts. I will change
the code to do that. Having the file inode numbers linked to the on-
disk zones will also be a useful information for debugging any problem.

> 
> > +/*
> > + * Read super block information from the device.
> > + */
> > +static int zonefs_read_super(struct super_block *sb)
> > +{
> > +	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
> > +	struct zonefs_super *super;
> > +	u32 crc, stored_crc;
> > +	struct page *page;
> > +	struct bio_vec bio_vec;
> > +	struct bio bio;
> > +	int ret;
> > +
> > +	page = alloc_page(GFP_KERNEL);
> > +	if (!page)
> > +		return -ENOMEM;
> > +
> > +	bio_init(&bio, &bio_vec, 1);
> > +	bio.bi_iter.bi_sector = 0;
> > +	bio_set_dev(&bio, sb->s_bdev);
> > +	bio_set_op_attrs(&bio, REQ_OP_READ, 0);
> > +	bio_add_page(&bio, page, PAGE_SIZE, 0);
> > +
> > +	ret = submit_bio_wait(&bio);
> > +	if (ret)
> > +		goto out;
> > +
> > +	super = page_address(page);
> > +
> > +	stored_crc = le32_to_cpu(super->s_crc);
> > +	super->s_crc = 0;
> > +	crc = crc32(~0U, (unsigned char *)super, sizeof(struct zonefs_super));
> > +	if (crc != stored_crc) {
> > +		zonefs_err(sb, "Invalid checksum (Expected 0x%08x, got 0x%08x)",
> > +			   crc, stored_crc);
> > +		ret = -EIO;
> > +		goto out;
> > +	}
> 
> Does this mean if mount or the kernel tries to autoprobe the
> filesystem type on a device it will get -EIO and an "Invalid
> checksum" error message rather than just silently returning -EINVAL
> because....
> 
> > +	ret = -EINVAL;
> > +	if (le32_to_cpu(super->s_magic) != ZONEFS_MAGIC)
> > +		goto out;
> 
> ... it isn't actually a zonefs filesystem?
> 
> i.e. shouldn't these checks be the other way around?

Good catch. Yes, the other way around definitely makes more sense.

Thank you for all your comments. Posting an update asap.

Best regards.
Dave Chinner Jan. 22, 2020, 11:11 p.m. UTC | #4
On Wed, Jan 22, 2020 at 10:07:07AM +0000, Damien Le Moal wrote:
> Dave,
> 
> On Wed, 2020-01-22 at 12:57 +1100, Dave Chinner wrote:
> > [...]
> > > +static int zonefs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> > > +			      unsigned int flags, struct iomap *iomap,
> > > +			      struct iomap *srcmap)
> > > +{
> > > +	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
> > > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > > +	loff_t max_isize = zi->i_max_size;
> > > +	loff_t isize;
> > > +
> > > +	/*
> > > +	 * For sequential zones, enforce direct IO writes. This is already
> > > +	 * checked when writes are issued, so warn about this here if we
> > > +	 * get buffered write to a sequential file inode.
> > > +	 */
> > > +	if (WARN_ON_ONCE(zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> > > +			 (flags & IOMAP_WRITE) && !(flags & IOMAP_DIRECT)))
> > > +		return -EIO;
> > > +
> > > +	/*
> > > +	 * For all zones, all blocks are always mapped. For sequential zones,
> > > +	 * all blocks after the write pointer (inode size) are always unwritten.
> > > +	 */
> > > +	mutex_lock(&zi->i_truncate_mutex);
> > > +	isize = i_size_read(inode);
> > > +	if (offset >= isize) {
> > > +		length = min(length, max_isize - offset);
> > > +		if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
> > > +			iomap->type = IOMAP_MAPPED;
> > > +		else
> > > +			iomap->type = IOMAP_UNWRITTEN;
> > > +	} else {
> > > +		length = min(length, isize - offset);
> > > +		iomap->type = IOMAP_MAPPED;
> > > +	}
> > 
> > Something was bugging me about this, and reading the rest of the
> > patch it finally triggered. For conventional zones, inode->i_size =
> > zi->i_max_size, and so if offset >= isize for a conventional
> > zone then this:
> > 
> > 	length = min(length, max_isize - offset);
> > 
> > is going to result in length <= 0 and we return a negative length
> > iomap.
> > 
> > IOWs, this case should only trigger for IO into sequential zones,
> > as it appears to be prevented at higher layers for conventional
> > zones by explicit checks against i_max_size and/or
> > iov_iter_truncate() calls to ensure user IOs are limited to within
> > i_max_size.
> > 
> > Hence it looks to me that triggering the (offset >= isize) case here
> > for conventional zones is a WARN_ON_ONCE() and return -EIO
> > situation...
> > 
> > SO, perhaps:
> > 
> > 	isize = i_size_read(inode);
> > 	if (offset >= isize) {
> > 		if (WARN_ON_ONCE(i->i_ztype == ZONEFS_ZTYPE_CNV)) {
> > 			/* drop locks */
> > 			return -EIO;
> > 		}
> > 		length = min(length, max_isize - offset);
> > 		iomap->type = IOMAP_UNWRITTEN;
> > 	} else {
> > 		length = min(length, isize - offset);
> > 		iomap->type = IOMAP_MAPPED;
> > 	}
> 
> Yes, that is much better indeed. I will change this.
> 
> > This also seems tailored around the call from zonefs_map_blocks()
> > which tries to map the entire zone (length = zi->i_max_size) for
> > writeback mappings. Hence the length in this case always requires
> > clamping to zi->i_max_size - offset. Again, there's an issue here:
> > 
> > > +static int zonefs_map_blocks(struct iomap_writepage_ctx *wpc,
> > > +			     struct inode *inode, loff_t offset)
> > > +{
> > > +	if (offset >= wpc->iomap.offset &&
> > > +	    offset < wpc->iomap.offset + wpc->iomap.length)
> > > +		return 0;
> > > +
> > > +	memset(&wpc->iomap, 0, sizeof(wpc->iomap));
> > > +	return zonefs_iomap_begin(inode, offset, ZONEFS_I(inode)->i_max_size,
> > > +				  0, &wpc->iomap, NULL);
> > 
> > Where we pass flags = 0 into zonefs_iomap_begin(), and so there is
> > no checking that this writeback code path is only executing against
> > a conventional zone. I.e. the comments and checks in
> > zonefs_iomap_begin() relate only to user IO call paths, but don't
> > validate or comment on the writeback path callers, and there's no
> > comments or checks here that the inode points at a conventional
> > zone, either....
> 
> I do not understand your point here. Since all blocks are always
> allocated for both conventional and sequential files, I think that
> using i_max_size for calling zonefs_iomap_begin is OK:

Yes, it is, but that wasn't really the point I was trying to make.
My comments around passing in the max size here means that
zonefs_iomap_begin() has to do clamping (i.e. the length = min(length,
max_isize - offset) calls) just for this caller, as all the other
callers from the user IO path already have their offset/lengths
clamped to isize/max_isize. Hence if zonefs_map_blocks() clamped
the length it passed to (i_max_size - offset) like all other callers
do, then code in zonefs_iomap_begin() would be simpler.


> for conventional
> zone files, any of these blocks can be written, both user direct or
> through the page cache. No distinction is I think necessary. For
> sequential zone files, only the blocks at "offset" can be written, and
> that value must be equal to zi->i_wpoffset (which account for in-
> flights writes). In both cases, exceeding the max file size is not
> allowed so this check is common in zonefs_iomap_begin() to cover all
> users and not just zonefs_map_blocks(). Did I get something wrong with
> iomap workings ?

No, the point I was trying to make (unsucessfully!) is that
zonefs_map_blocks() is only called from buffered writeback, and
so it only called on conventional zone writes. Neither
zonefs_map_blocks() or zonefs_iomap_begin() check this, and
zonefs_iomap_begin() can't because it doesn't have any flags passed
into it to tell it that a mapping for a write is being done.

i.e. somewhere in this zonefs_map_blocks() codepath there needs to
be a check like:

	WARN_ON_ONCE(zi->zi_type != ZONEFS_ZTYPE_CNV);

because we should never get here for sequential zones.

And that then raises the question - if we can only get here for
conventional zones, then wouldn't the code read better using
the inode->i_size rather that the zi->i_max_size as all IO to
conventional zones must be within the inode size?

> > > +	pos = (zone->wp - zone->start) << SECTOR_SHIFT;
> > > +	zi->i_wpoffset = pos;
> > > +	if (i_size_read(inode) != pos) {
> > > +		zonefs_update_stats(inode, pos);
> > > +		i_size_write(inode, pos);
> > > +	}
> > 
> > What happens if this decreases the size of the zone? don't we need
> > to invalidate the page cache beyond the new EOF in this case (i.e.
> > it's a truncate operation)?
> 
> This is called only for direct write errors into sequential zones.
> Since for that case we only deal with append direct writes, there is no
> possibility of having any of the written data cached already. So even
> if we get a short write or complete failure, no invalidation is needed.

Ah, there's a undocumented assumption that a write error never
resets the zone write pointer completely, but only remains unchanged
from where it was prior to the write that failed. My concern is what
happens if the device decides that the error has caused the zone to
be completely lost and so resets the write pointer back to zero?

And the other concern here is what if the hardware write pointer
still moves forward and exposes stale data because the write failed?

> Compared to errors for read operations in any zone, or conventional
> zone files read/write errors, this error handling adds processing of
> zone condition changes (error due to a zone going offline or read-
> only). I could add the same treatment for all IO errors. I did not
> since if we start seeing these zone conditions, it is likely that the
> drive is about to die.

Ok, so it's not expected, but it sounds like in extreme
circumstances it can still occur, and hence we still should try to
handle such errors in a sane manner.

> So conventional zone writes and all read errors
> are treated like on any other FS: only return the error to the user
> without any drive-specific forensic done.

Sure, but they don't go through this new error path :)

> > > +	/*
> > > +	 * blkdev_report_zones() uses GFP_KERNEL by default. Force execution as
> > > +	 * if GFP_NOFS was specified so that it will not end up recursing into
> > > +	 * the FS on memory allocation.
> > > +	 */
> > > +	nofs_flag = memalloc_nofs_save();
> > > +	ret = blkdev_report_zones(sb->s_bdev, sector, 1,
> > > +				  zonefs_report_zones_err_cb, inode);
> > > +	memalloc_nofs_restore(nofs_flag);
> > 
> > The comment is kinda redundant - it's explaining exactly what the
> > code does rather than why it needs this protection. i.e. the comment
> > should explain the recursion vector/deadlock that we are avoiding
> > here...
> 
> Yes. Changed it to:
> 
> /*
>  * Report zones memory allocation could trigger a recursion into zonefs
>  * due to memory reclaim. Since this is always called with the inode
>  * truncate mutex lock being held, avoid the potential recursion
>  * deadlock using a GFP_NOFS allocation.
>  */
> 
> > 
> > > +static int zonefs_file_dio_write_end(struct kiocb *iocb, ssize_t size, int ret,
> > > +				     unsigned int flags)
> > > +{
> > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > > +
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	/*
> > > +	 * Conventional zone file size is fixed to the zone size so there
> > > +	 * is no need to do anything.
> > > +	 */
> > > +	if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
> > > +		return 0;
> > > +
> > > +	mutex_lock(&zi->i_truncate_mutex);
> > > +
> > > +	if (size < 0) {
> > > +		ret = zonefs_seq_file_write_failed(inode, size);
> > 
> > Ok, so I see it is being called from IO completion context, whcih
> > means we'd want memalloc_noio_save() because the underlying bio
> > doesn't get freed until this whole completion runs, right?
> 
> Yes, the failed BIO is freed only after the report zone is done. But
> more than GFP_NOIO, we want GFP_NOFS for the reason stated above.

Yes, I can see that GFP_NOFS is needed to avoid truncate lock
recursion. However, we are in an IO completion routine here holding
a bio, so what I'm asking is whether reclaim recursion back into the
block layer and allocating more bios (e.g.  for swap to a
conventional zone within the same zoned block device) is safe to do
while we hold a bio from the same bioset that swap bios will be
allocated from...

i.e. doesn't this violate the forwards progress guarantee we need
for bioset mempools? i.e. we now can't free a bio if nested
allocation of a bio blocks waiting for a bio to be freed...

> > > +static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
> > > +{
> > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > +	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
> > > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > > +	size_t count;
> > > +	ssize_t ret;
> > > +
> > > +	if (iocb->ki_flags & IOCB_NOWAIT) {
> > > +		if (!inode_trylock(inode))
> > > +			return -EAGAIN;
> > > +	} else {
> > > +		inode_lock(inode);
> > > +	}
> > > +
> > > +	ret = generic_write_checks(iocb, from);
> > > +	if (ret <= 0)
> > > +		goto out;
> > > +
> > > +	iov_iter_truncate(from, zi->i_max_size - iocb->ki_pos);
> > > +	count = iov_iter_count(from);
> > 
> > So count is initialised to the entire IO length....
> 
> Well, yes, count reflects the truncated iov_iter size. This is
> necessary for the AIO case when iomap_dio_rw() returns -EIOCBQUEUED so
> that we can account for the inflight AIOs for an eventual subsequent
> AIO submission by the user (see next comment below). For sync writes
> (or AIOs that completed very quickly), the final value for count is
> updated using iomap_dio_rw() return value.

Sure.

> > > +	/*
> > > +	 * Direct writes must be aligned to the block size, that is, the device
> > > +	 * physical sector size, to avoid errors when writing sequential zones
> > > +	 * on 512e devices (512B logical sector, 4KB physical sectors).
> > > +	 */
> > > +	if ((iocb->ki_pos | count) & sbi->s_blocksize_mask) {
> > > +		ret = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Enforce sequential writes (append only) in sequential zones.
> > > +	 */
> > > +	mutex_lock(&zi->i_truncate_mutex);
> > > +	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> > > +	    iocb->ki_pos != zi->i_wpoffset) {
> > > +		zonefs_err(inode->i_sb,
> > > +			   "Unaligned write at %llu + %zu (wp %llu)\n",
> > > +			   iocb->ki_pos, count,
> > > +			   zi->i_wpoffset);
> > > +		mutex_unlock(&zi->i_truncate_mutex);
> > > +		ret = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +	mutex_unlock(&zi->i_truncate_mutex);
> > > +
> > > +	ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops, &zonefs_dio_ops,
> > > +			   is_sync_kiocb(iocb));
> > > +	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> > > +	    (ret > 0 || ret == -EIOCBQUEUED)) {
> > > +		if (ret > 0)
> > > +			count = ret;
> > > +		mutex_lock(&zi->i_truncate_mutex);
> > > +		zi->i_wpoffset += count;
> > > +		mutex_unlock(&zi->i_truncate_mutex);
> > 
> > Hmmmm. This looks problematic w.r.t. AIO. If we get -EIOCBQUEUED it
> > means the IO has been queued but not necessarily submitted, but
> > we update zi->i_wpoffset as though the entire AIO has laready
> > completed. ANd then we drop the inode_lock() and return, allowing
> > another AIO+DIO to be started.
> > 
> > Hence another concurrent sequential AIO+DIO write could now be
> > submitted and pass the above iocb->ki_pos != zi->i_wpoffset check.
> > Now we have two independent IOs in flight - one that is at the
> > current hardware write pointer offset, and one that is beyond it.
> > 
> > What happens if the block layer now re-orders these two IOs?
> 
> If the correct block scheduler is used, that is mq-deadline, there is
> no possibility of write reordering.

Oh, my.

That needs a great big warning in the code. This assumes the block
layer functions in a specific manner, and there is no way to
guarantee that it does at the filesystem layer. Hence if the block
layer is subtly broken (which has happened far too many times in the
past couple of years for me to just trust it anymore) then this code
can result in spurious write failures for applications that use
AIO+DIO...

> mq-deadline is now the default IO
> scheduler for zoned block devices and the only one that is allowed
> (beside "none"). mq-deadline uses a zone write locking mechanism to
> ensure that there is no reordering of write requests, either by the
> block layer itself or by bad hardware (SATA AHCI adapters are
> notoriously bad and silently reorder requests all the time, even for
> SMR disks).
>
> With this mechanism, the user can safely use io_submit() beyond a
> single IO and zonefs check that the set of AIOs being submitted are all
> sequential starting from the zi->i_wpoffset "soft" write pointer that
> reflects the already in-flight AIOs. Multiple io_submit() of multiple
> AIOs can be executed in sequence without needing to limit to a single
> AIO at a time.

I can see lots of potential problems with AIO on a filesystem that
assumes sequential, ordered AIO submission. e.g. RWF_NOWAIT and
submitting multiple sequential IOs at a time. First IO gets EAGAIN
because a lock is held by something else, second IO gets the lock
and now returns -EINVAL because it's offset no longer matches the
write pointer because the first IO got -EAGAIN and punted back to
userspace.

Or worse, it's io_uring, and it punts that IO to a worker thread to
resubmit. Now that IO will be issued out of order to all the others,
and so userspace will see that it succeeds, but all the other IOs in
the sequential batch get -EINVAL because of IO reordering long
before the IO even gets to the block layer....

> If a disk error occurs along the way, the seq file size and zi-
> >i_wpoffset are updated using the report zone result. All in-flight or
> submitted AIOs after the failed one will be failed by the disk itself
> due to the their now unaligned position. These failures will not change
> again the file size or zi->i_wpoffset since the zone information will
> be the same after all failures. The user only has to look at the file
> size again to know were to restart writing from without even needing to
> wait for all in-flight AIO to complete with an error (but that would of
> course be the recommended practice).

So I'm guessing that the same failure condition will return
different errors based on where the failure is detected. e.g. EINVAL
if it's at the write submission layer and EIO if it is reported by
the hardware?

> In other word, we assume here that all write succeed and allow high-
> queue depth submission using zi->i_wpoffset as a "soft" write pointer.

I'm starting to wonder whether it is a good idea to even support AIO
on sequential zones because there are some really messy spurious
failure cases that userspace will not be able to distinguish from
real write errors. That's not a very nice API for applications to
have to deal with...

At minimum, this needs extensive documentation both for users and
for kernel filesystem people that need to maintain this code forever
more...

Cheers,

Dave.
Damien Le Moal Jan. 23, 2020, 1:21 p.m. UTC | #5
On 2020/01/23 8:11, Dave Chinner wrote:
[...]
>>> This also seems tailored around the call from zonefs_map_blocks()
>>> which tries to map the entire zone (length = zi->i_max_size) for
>>> writeback mappings. Hence the length in this case always requires
>>> clamping to zi->i_max_size - offset. Again, there's an issue here:
>>>
>>>> +static int zonefs_map_blocks(struct iomap_writepage_ctx *wpc,
>>>> +			     struct inode *inode, loff_t offset)
>>>> +{
>>>> +	if (offset >= wpc->iomap.offset &&
>>>> +	    offset < wpc->iomap.offset + wpc->iomap.length)
>>>> +		return 0;
>>>> +
>>>> +	memset(&wpc->iomap, 0, sizeof(wpc->iomap));
>>>> +	return zonefs_iomap_begin(inode, offset, ZONEFS_I(inode)->i_max_size,
>>>> +				  0, &wpc->iomap, NULL);
>>>
>>> Where we pass flags = 0 into zonefs_iomap_begin(), and so there is
>>> no checking that this writeback code path is only executing against
>>> a conventional zone. I.e. the comments and checks in
>>> zonefs_iomap_begin() relate only to user IO call paths, but don't
>>> validate or comment on the writeback path callers, and there's no
>>> comments or checks here that the inode points at a conventional
>>> zone, either....
>>
>> I do not understand your point here. Since all blocks are always
>> allocated for both conventional and sequential files, I think that
>> using i_max_size for calling zonefs_iomap_begin is OK:
> 
> Yes, it is, but that wasn't really the point I was trying to make.
> My comments around passing in the max size here means that
> zonefs_iomap_begin() has to do clamping (i.e. the length = min(length,
> max_isize - offset) calls) just for this caller, as all the other
> callers from the user IO path already have their offset/lengths
> clamped to isize/max_isize. Hence if zonefs_map_blocks() clamped
> the length it passed to (i_max_size - offset) like all other callers
> do, then code in zonefs_iomap_begin() would be simpler.

Aaah, I see it now. Indeed, that would be cleaner.

>> for conventional
>> zone files, any of these blocks can be written, both user direct or
>> through the page cache. No distinction is I think necessary. For
>> sequential zone files, only the blocks at "offset" can be written, and
>> that value must be equal to zi->i_wpoffset (which account for in-
>> flights writes). In both cases, exceeding the max file size is not
>> allowed so this check is common in zonefs_iomap_begin() to cover all
>> users and not just zonefs_map_blocks(). Did I get something wrong with
>> iomap workings ?
> 
> No, the point I was trying to make (unsucessfully!) is that
> zonefs_map_blocks() is only called from buffered writeback, and
> so it only called on conventional zone writes. Neither
> zonefs_map_blocks() or zonefs_iomap_begin() check this, and
> zonefs_iomap_begin() can't because it doesn't have any flags passed
> into it to tell it that a mapping for a write is being done.
> 
> i.e. somewhere in this zonefs_map_blocks() codepath there needs to
> be a check like:
> 
> 	WARN_ON_ONCE(zi->zi_type != ZONEFS_ZTYPE_CNV);
> 
> because we should never get here for sequential zones.
> 
> And that then raises the question - if we can only get here for
> conventional zones, then wouldn't the code read better using
> the inode->i_size rather that the zi->i_max_size as all IO to
> conventional zones must be within the inode size?

Got it. It is obvious now :) Will clean that.

>>>> +	pos = (zone->wp - zone->start) << SECTOR_SHIFT;
>>>> +	zi->i_wpoffset = pos;
>>>> +	if (i_size_read(inode) != pos) {
>>>> +		zonefs_update_stats(inode, pos);
>>>> +		i_size_write(inode, pos);
>>>> +	}
>>>
>>> What happens if this decreases the size of the zone? don't we need
>>> to invalidate the page cache beyond the new EOF in this case (i.e.
>>> it's a truncate operation)?
>>
>> This is called only for direct write errors into sequential zones.
>> Since for that case we only deal with append direct writes, there is no
>> possibility of having any of the written data cached already. So even
>> if we get a short write or complete failure, no invalidation is needed.
> 
> Ah, there's a undocumented assumption that a write error never
> resets the zone write pointer completely, but only remains unchanged
> from where it was prior to the write that failed. My concern is what
> happens if the device decides that the error has caused the zone to
> be completely lost and so resets the write pointer back to zero?

Such behavior is forbidden by the ZBC/ZAC specifications. If it happens, we
would be fighting with a drive that has serious firmware bugs. So I am not
sure if it is worth adding code for that. But at the very least, adding a
warning would be nice.

> And the other concern here is what if the hardware write pointer
> still moves forward and exposes stale data because the write failed?

The write pointer can indeed move forward even in the case of a failed
write. This typically happens if a bad sector is hit in the middle of a
write. In this case the write pointer is updated to point right after the
last written sector, so no stale data is exposed. This is mandated by the
ZBC/ZAC specifications. If the drive firmware is bad and the WP goes beyond
the last written sector, we would have no way of detecting the stale data.
But on SMR, those sectors will most likely be unreadable anyway. Movement
of the write pointer beyond the IO end though can be tested for.
I will improve the checks on error.

>> Compared to errors for read operations in any zone, or conventional
>> zone files read/write errors, this error handling adds processing of
>> zone condition changes (error due to a zone going offline or read-
>> only). I could add the same treatment for all IO errors. I did not
>> since if we start seeing these zone conditions, it is likely that the
>> drive is about to die.
> 
> Ok, so it's not expected, but it sounds like in extreme
> circumstances it can still occur, and hence we still should try to
> handle such errors in a sane manner.

Yes. We can improve the checks. And if an unexpected (non standard) change
is detected, at the very least throw a warning, and possibly remount
read-only. For zones going offline or read-only, that is handled already
and the file attributes changed accordingly.

>> So conventional zone writes and all read errors
>> are treated like on any other FS: only return the error to the user
>> without any drive-specific forensic done.
> 
> Sure, but they don't go through this new error path :)

No they don't, but looking some more at the code, it is not too hard to
include it for all IOs to all zone types. I will do the change.

>>>> +static int zonefs_file_dio_write_end(struct kiocb *iocb, ssize_t size, int ret,
>>>> +				     unsigned int flags)
>>>> +{
>>>> +	struct inode *inode = file_inode(iocb->ki_filp);
>>>> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
>>>> +
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	/*
>>>> +	 * Conventional zone file size is fixed to the zone size so there
>>>> +	 * is no need to do anything.
>>>> +	 */
>>>> +	if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
>>>> +		return 0;
>>>> +
>>>> +	mutex_lock(&zi->i_truncate_mutex);
>>>> +
>>>> +	if (size < 0) {
>>>> +		ret = zonefs_seq_file_write_failed(inode, size);
>>>
>>> Ok, so I see it is being called from IO completion context, whcih
>>> means we'd want memalloc_noio_save() because the underlying bio
>>> doesn't get freed until this whole completion runs, right?
>>
>> Yes, the failed BIO is freed only after the report zone is done. But
>> more than GFP_NOIO, we want GFP_NOFS for the reason stated above.
> 
> Yes, I can see that GFP_NOFS is needed to avoid truncate lock
> recursion. However, we are in an IO completion routine here holding
> a bio, so what I'm asking is whether reclaim recursion back into the
> block layer and allocating more bios (e.g.  for swap to a
> conventional zone within the same zoned block device) is safe to do
> while we hold a bio from the same bioset that swap bios will be
> allocated from...
> 
> i.e. doesn't this violate the forwards progress guarantee we need
> for bioset mempools? i.e. we now can't free a bio if nested
> allocation of a bio blocks waiting for a bio to be freed...

Swap does not work on zoned block devices :) But I see your point. I think
it is indeed safer to use GFP_NOIO. Will change to that.

>>>> +	/*
>>>> +	 * Direct writes must be aligned to the block size, that is, the device
>>>> +	 * physical sector size, to avoid errors when writing sequential zones
>>>> +	 * on 512e devices (512B logical sector, 4KB physical sectors).
>>>> +	 */
>>>> +	if ((iocb->ki_pos | count) & sbi->s_blocksize_mask) {
>>>> +		ret = -EINVAL;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * Enforce sequential writes (append only) in sequential zones.
>>>> +	 */
>>>> +	mutex_lock(&zi->i_truncate_mutex);
>>>> +	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
>>>> +	    iocb->ki_pos != zi->i_wpoffset) {
>>>> +		zonefs_err(inode->i_sb,
>>>> +			   "Unaligned write at %llu + %zu (wp %llu)\n",
>>>> +			   iocb->ki_pos, count,
>>>> +			   zi->i_wpoffset);
>>>> +		mutex_unlock(&zi->i_truncate_mutex);
>>>> +		ret = -EINVAL;
>>>> +		goto out;
>>>> +	}
>>>> +	mutex_unlock(&zi->i_truncate_mutex);
>>>> +
>>>> +	ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops, &zonefs_dio_ops,
>>>> +			   is_sync_kiocb(iocb));
>>>> +	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
>>>> +	    (ret > 0 || ret == -EIOCBQUEUED)) {
>>>> +		if (ret > 0)
>>>> +			count = ret;
>>>> +		mutex_lock(&zi->i_truncate_mutex);
>>>> +		zi->i_wpoffset += count;
>>>> +		mutex_unlock(&zi->i_truncate_mutex);
>>>
>>> Hmmmm. This looks problematic w.r.t. AIO. If we get -EIOCBQUEUED it
>>> means the IO has been queued but not necessarily submitted, but
>>> we update zi->i_wpoffset as though the entire AIO has laready
>>> completed. ANd then we drop the inode_lock() and return, allowing
>>> another AIO+DIO to be started.
>>>
>>> Hence another concurrent sequential AIO+DIO write could now be
>>> submitted and pass the above iocb->ki_pos != zi->i_wpoffset check.
>>> Now we have two independent IOs in flight - one that is at the
>>> current hardware write pointer offset, and one that is beyond it.
>>>
>>> What happens if the block layer now re-orders these two IOs?
>>
>> If the correct block scheduler is used, that is mq-deadline, there is
>> no possibility of write reordering.
> 
> Oh, my.
> 
> That needs a great big warning in the code. This assumes the block
> layer functions in a specific manner, and there is no way to
> guarantee that it does at the filesystem layer. Hence if the block
> layer is subtly broken (which has happened far too many times in the
> past couple of years for me to just trust it anymore) then this code
> can result in spurious write failures for applications that use
> AIO+DIO...

Yes. Absolutely true. I will add comments explaining this point. Of note is
that the same is true not only for AIOs but also for large sync write DIOs
too as these can get split into multiple write requests that can get
reordered without the correct system setup.

>> mq-deadline is now the default IO
>> scheduler for zoned block devices and the only one that is allowed
>> (beside "none"). mq-deadline uses a zone write locking mechanism to
>> ensure that there is no reordering of write requests, either by the
>> block layer itself or by bad hardware (SATA AHCI adapters are
>> notoriously bad and silently reorder requests all the time, even for
>> SMR disks).
>>
>> With this mechanism, the user can safely use io_submit() beyond a
>> single IO and zonefs check that the set of AIOs being submitted are all
>> sequential starting from the zi->i_wpoffset "soft" write pointer that
>> reflects the already in-flight AIOs. Multiple io_submit() of multiple
>> AIOs can be executed in sequence without needing to limit to a single
>> AIO at a time.
> 
> I can see lots of potential problems with AIO on a filesystem that
> assumes sequential, ordered AIO submission. e.g. RWF_NOWAIT and
> submitting multiple sequential IOs at a time. First IO gets EAGAIN
> because a lock is held by something else, second IO gets the lock
> and now returns -EINVAL because it's offset no longer matches the
> write pointer because the first IO got -EAGAIN and punted back to
> userspace.

Ah, yes indeed. I overlooked this case. I guess we have several solutions:
(1) return an error if RWF_NOWAIT is specified, (2) Ignore it and proceed
as if it is not specified, or (3) keep as is and let the user handle the
error pattern.

This error case is still safe with regard to the file size/zone WP: the
file size is kept unchanged from the value inspected for the first -EAGAIN
failed AIO and the user can restart writing from there. Not too bad. But I
am leaning toward solution (2) to reduce these potential spurious errors.

> Or worse, it's io_uring, and it punts that IO to a worker thread to
> resubmit. Now that IO will be issued out of order to all the others,
> and so userspace will see that it succeeds, but all the other IOs in
> the sequential batch get -EINVAL because of IO reordering long
> before the IO even gets to the block layer....

Yes, same here with the difference that we would have requests being
uselessly sent to the disk and failed with EIO instead of EINVAL. But
again, this is still safe with regard to the inode size/write pointer as no
change happen.

>> If a disk error occurs along the way, the seq file size and zi-
>>> i_wpoffset are updated using the report zone result. All in-flight or
>> submitted AIOs after the failed one will be failed by the disk itself
>> due to the their now unaligned position. These failures will not change
>> again the file size or zi->i_wpoffset since the zone information will
>> be the same after all failures. The user only has to look at the file
>> size again to know were to restart writing from without even needing to
>> wait for all in-flight AIO to complete with an error (but that would of
>> course be the recommended practice).
> 
> So I'm guessing that the same failure condition will return
> different errors based on where the failure is detected. e.g. EINVAL
> if it's at the write submission layer and EIO if it is reported by
> the hardware?

Yes, correct. With some additional work in the block layer, we could also
return -EINVAL for all unaligned writes failed by the disk. This would
indicate a soft recoverable error rather than the generally more serious
EIO. At the drive level, ZBC and ZAC define sense codes for unaligned
writes which could be mapped to a new blk status code, which in turn can be
converted to a -EINVAL for failed unaligned write bios. That can be done
separately from zonefs though. Will look at it.

>> In other word, we assume here that all write succeed and allow high-
>> queue depth submission using zi->i_wpoffset as a "soft" write pointer.
> 
> I'm starting to wonder whether it is a good idea to even support AIO
> on sequential zones because there are some really messy spurious
> failure cases that userspace will not be able to distinguish from
> real write errors. That's not a very nice API for applications to
> have to deal with...

Yes, but I would argue that zonefs does not make this error processing
harder than with the raw zoned block device use case. If anything, zonefs
is already an improvement over the complete lack of zone device specific
error handling of the block device file IO path. Unless I get a lot of
push-back, I would rather keep AIOs and work on suppressing bad patterns
due to RWF_NOWAIT and other corner cases.

> At minimum, this needs extensive documentation both for users and
> for kernel filesystem people that need to maintain this code forever
> more...

Yes, OK. I will add more comments and improve the documentation file
mentioning these points.

Thank you for your comments.

Best regards.

Patch
diff mbox series

diff --git a/MAINTAINERS b/MAINTAINERS
index 8982c6e013b3..2ffdeaa7191e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18304,6 +18304,15 @@  L:	linux-kernel@vger.kernel.org
 S:	Maintained
 F:	arch/x86/kernel/cpu/zhaoxin.c
 
+ZONEFS FILESYSTEM
+M:	Damien Le Moal <damien.lemoal@wdc.com>
+M:	Naohiro Aota <naohiro.aota@wdc.com>
+R:	Johannes Thumshirn <jth@kernel.org>
+L:	linux-fsdevel@vger.kernel.org
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs.git
+S:	Maintained
+F:	fs/zonefs/
+
 ZPOOL COMPRESSED PAGE STORAGE API
 M:	Dan Streetman <ddstreet@ieee.org>
 L:	linux-mm@kvack.org
diff --git a/fs/Kconfig b/fs/Kconfig
index 7b623e9fc1b0..a3f97ca2bd46 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -40,6 +40,7 @@  source "fs/ocfs2/Kconfig"
 source "fs/btrfs/Kconfig"
 source "fs/nilfs2/Kconfig"
 source "fs/f2fs/Kconfig"
+source "fs/zonefs/Kconfig"
 
 config FS_DAX
 	bool "Direct Access (DAX) support"
diff --git a/fs/Makefile b/fs/Makefile
index 1148c555c4d3..527f228a5e8a 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -133,3 +133,4 @@  obj-$(CONFIG_CEPH_FS)		+= ceph/
 obj-$(CONFIG_PSTORE)		+= pstore/
 obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
 obj-$(CONFIG_EROFS_FS)		+= erofs/
+obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
diff --git a/fs/zonefs/Kconfig b/fs/zonefs/Kconfig
new file mode 100644
index 000000000000..03a4ef80f975
--- /dev/null
+++ b/fs/zonefs/Kconfig
@@ -0,0 +1,9 @@ 
+config ZONEFS_FS
+	tristate "zonefs filesystem support"
+	depends on BLOCK
+	depends on BLK_DEV_ZONED
+	help
+	  zonefs is a simple File System which exposes zones of a zoned block
+	  device (e.g. host-managed or host-aware SMR disk drives) as files.
+
+	  If unsure, say N.
diff --git a/fs/zonefs/Makefile b/fs/zonefs/Makefile
new file mode 100644
index 000000000000..75a380aa1ae1
--- /dev/null
+++ b/fs/zonefs/Makefile
@@ -0,0 +1,4 @@ 
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_ZONEFS_FS) += zonefs.o
+
+zonefs-y	:= super.o
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
new file mode 100644
index 000000000000..e466bd14df12
--- /dev/null
+++ b/fs/zonefs/super.c
@@ -0,0 +1,1178 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Simple zone file system for zoned block devices.
+ *
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ */
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <linux/iomap.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/statfs.h>
+#include <linux/writeback.h>
+#include <linux/quotaops.h>
+#include <linux/seq_file.h>
+#include <linux/parser.h>
+#include <linux/uio.h>
+#include <linux/mman.h>
+#include <linux/sched/mm.h>
+#include <linux/crc32.h>
+
+#include "zonefs.h"
+
+static int zonefs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+			      unsigned int flags, struct iomap *iomap,
+			      struct iomap *srcmap)
+{
+	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	loff_t max_isize = zi->i_max_size;
+	loff_t isize;
+
+	/*
+	 * For sequential zones, enforce direct IO writes. This is already
+	 * checked when writes are issued, so warn about this here if we
+	 * get buffered write to a sequential file inode.
+	 */
+	if (WARN_ON_ONCE(zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
+			 (flags & IOMAP_WRITE) && !(flags & IOMAP_DIRECT)))
+		return -EIO;
+
+	/*
+	 * For all zones, all blocks are always mapped. For sequential zones,
+	 * all blocks after the write pointer (inode size) are always unwritten.
+	 */
+	mutex_lock(&zi->i_truncate_mutex);
+	isize = i_size_read(inode);
+	if (offset >= isize) {
+		length = min(length, max_isize - offset);
+		if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
+			iomap->type = IOMAP_MAPPED;
+		else
+			iomap->type = IOMAP_UNWRITTEN;
+	} else {
+		length = min(length, isize - offset);
+		iomap->type = IOMAP_MAPPED;
+	}
+	mutex_unlock(&zi->i_truncate_mutex);
+
+	iomap->offset = offset & (~sbi->s_blocksize_mask);
+	iomap->length = ((offset + length + sbi->s_blocksize_mask) &
+			 (~sbi->s_blocksize_mask)) - iomap->offset;
+	iomap->bdev = inode->i_sb->s_bdev;
+	iomap->addr = (zi->i_zsector << SECTOR_SHIFT) + iomap->offset;
+
+	return 0;
+}
+
+static const struct iomap_ops zonefs_iomap_ops = {
+	.iomap_begin	= zonefs_iomap_begin,
+};
+
+static int zonefs_readpage(struct file *unused, struct page *page)
+{
+	return iomap_readpage(page, &zonefs_iomap_ops);
+}
+
+static int zonefs_readpages(struct file *unused, struct address_space *mapping,
+			    struct list_head *pages, unsigned int nr_pages)
+{
+	return iomap_readpages(mapping, pages, nr_pages, &zonefs_iomap_ops);
+}
+
+static int zonefs_map_blocks(struct iomap_writepage_ctx *wpc,
+			     struct inode *inode, loff_t offset)
+{
+	if (offset >= wpc->iomap.offset &&
+	    offset < wpc->iomap.offset + wpc->iomap.length)
+		return 0;
+
+	memset(&wpc->iomap, 0, sizeof(wpc->iomap));
+	return zonefs_iomap_begin(inode, offset, ZONEFS_I(inode)->i_max_size,
+				  0, &wpc->iomap, NULL);
+}
+
+static const struct iomap_writeback_ops zonefs_writeback_ops = {
+	.map_blocks		= zonefs_map_blocks,
+};
+
+static int zonefs_writepage(struct page *page, struct writeback_control *wbc)
+{
+	struct iomap_writepage_ctx wpc = { };
+
+	return iomap_writepage(page, wbc, &wpc, &zonefs_writeback_ops);
+}
+
+static int zonefs_writepages(struct address_space *mapping,
+			     struct writeback_control *wbc)
+{
+	struct iomap_writepage_ctx wpc = { };
+
+	return iomap_writepages(mapping, wbc, &wpc, &zonefs_writeback_ops);
+}
+
+static const struct address_space_operations zonefs_file_aops = {
+	.readpage		= zonefs_readpage,
+	.readpages		= zonefs_readpages,
+	.writepage		= zonefs_writepage,
+	.writepages		= zonefs_writepages,
+	.set_page_dirty		= iomap_set_page_dirty,
+	.releasepage		= iomap_releasepage,
+	.invalidatepage		= iomap_invalidatepage,
+	.migratepage		= iomap_migrate_page,
+	.is_partially_uptodate  = iomap_is_partially_uptodate,
+	.error_remove_page	= generic_error_remove_page,
+	.direct_IO		= noop_direct_IO,
+};
+
+static void zonefs_update_stats(struct inode *inode, loff_t new_isize)
+{
+	struct super_block *sb = inode->i_sb;
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+	loff_t old_isize = i_size_read(inode);
+
+	if (new_isize == old_isize)
+		return;
+
+	spin_lock(&sbi->s_lock);
+
+	if (!new_isize) {
+		/* File truncated to 0 */
+		sbi->s_used_blocks -= old_isize >> sb->s_blocksize_bits;
+	} else if (new_isize > old_isize) {
+		/* File written or truncated to max size */
+		sbi->s_used_blocks +=
+			(new_isize - old_isize) >> sb->s_blocksize_bits;
+	} else {
+		/* Sequential zone files can only grow or be truncated to 0 */
+		WARN_ON(new_isize < old_isize);
+	}
+
+	spin_unlock(&sbi->s_lock);
+}
+
+static int zonefs_seq_file_truncate(struct inode *inode, loff_t isize)
+{
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	loff_t old_isize;
+	enum req_opf op;
+	int ret = 0;
+
+	/*
+	 * For sequential zone files, we can only allow truncating to 0 size,
+	 * which is equivalent to a zone reset, or to the maximum file size,
+	 * which is equivalent to a zone finish.
+	 */
+	if (!isize)
+		op = REQ_OP_ZONE_RESET;
+	else if (isize == zi->i_max_size)
+		op = REQ_OP_ZONE_FINISH;
+	else
+		return -EPERM;
+
+	inode_dio_wait(inode);
+
+	/* Serialize against page faults */
+	down_write(&zi->i_mmap_sem);
+
+	/* Serialize against zonefs_iomap_begin() */
+	mutex_lock(&zi->i_truncate_mutex);
+
+	old_isize = i_size_read(inode);
+	if (isize == old_isize)
+		goto unlock;
+
+	ret = blkdev_zone_mgmt(inode->i_sb->s_bdev, op, zi->i_zsector,
+			       zi->i_max_size >> SECTOR_SHIFT, GFP_NOFS);
+	if (ret) {
+		zonefs_err(inode->i_sb,
+			   "Zone management operation at %llu failed %d",
+			   zi->i_zsector, ret);
+		goto unlock;
+	}
+
+	zonefs_update_stats(inode, isize);
+	truncate_setsize(inode, isize);
+	zi->i_wpoffset = isize;
+
+unlock:
+	mutex_unlock(&zi->i_truncate_mutex);
+	up_write(&zi->i_mmap_sem);
+
+	return ret;
+}
+
+static int zonefs_inode_setattr(struct dentry *dentry, struct iattr *iattr)
+{
+	struct inode *inode = d_inode(dentry);
+	int ret;
+
+	ret = setattr_prepare(dentry, iattr);
+	if (ret)
+		return ret;
+
+	/*
+	 * Since files and directories cannot be created nor deleted, do not
+	 * allow setting any write attributes on the zone types sub-directories.
+	 */
+	if ((iattr->ia_valid & ATTR_MODE) && S_ISDIR(inode->i_mode) &&
+	    (iattr->ia_mode & 0222))
+		return -EPERM;
+
+	if (((iattr->ia_valid & ATTR_UID) &&
+	     !uid_eq(iattr->ia_uid, inode->i_uid)) ||
+	    ((iattr->ia_valid & ATTR_GID) &&
+	     !gid_eq(iattr->ia_gid, inode->i_gid))) {
+		ret = dquot_transfer(inode, iattr);
+		if (ret)
+			return ret;
+	}
+
+	if (iattr->ia_valid & ATTR_SIZE) {
+		/* The size of conventional zone files cannot be changed */
+		if (ZONEFS_I(inode)->i_ztype == ZONEFS_ZTYPE_CNV)
+			return -EPERM;
+
+		ret = zonefs_seq_file_truncate(inode, iattr->ia_size);
+		if (ret)
+			return ret;
+	}
+
+	setattr_copy(inode, iattr);
+
+	return 0;
+}
+
+static const struct inode_operations zonefs_file_inode_operations = {
+	.setattr	= zonefs_inode_setattr,
+};
+
+static int zonefs_file_fsync(struct file *file, loff_t start, loff_t end,
+			     int datasync)
+{
+	struct inode *inode = file_inode(file);
+	int ret = 0;
+
+	/*
+	 * Since only direct writes are allowed in sequential files, page cache
+	 * flush is needed only for conventional zone files.
+	 */
+	if (ZONEFS_I(inode)->i_ztype == ZONEFS_ZTYPE_CNV) {
+		ret = file_write_and_wait_range(file, start, end);
+		if (ret)
+			return ret;
+		ret = file_check_and_advance_wb_err(file);
+	}
+
+	if (ret == 0)
+		ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
+
+	return ret;
+}
+
+static vm_fault_t zonefs_filemap_fault(struct vm_fault *vmf)
+{
+	struct zonefs_inode_info *zi = ZONEFS_I(file_inode(vmf->vma->vm_file));
+	vm_fault_t ret;
+
+	down_read(&zi->i_mmap_sem);
+	ret = filemap_fault(vmf);
+	up_read(&zi->i_mmap_sem);
+
+	return ret;
+}
+
+static vm_fault_t zonefs_filemap_page_mkwrite(struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	vm_fault_t ret;
+
+	sb_start_pagefault(inode->i_sb);
+	file_update_time(vmf->vma->vm_file);
+
+	/* Serialize against truncates */
+	down_read(&zi->i_mmap_sem);
+	ret = iomap_page_mkwrite(vmf, &zonefs_iomap_ops);
+	up_read(&zi->i_mmap_sem);
+
+	sb_end_pagefault(inode->i_sb);
+	return ret;
+}
+
+static const struct vm_operations_struct zonefs_file_vm_ops = {
+	.fault		= zonefs_filemap_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= zonefs_filemap_page_mkwrite,
+};
+
+static int zonefs_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	/*
+	 * Since conventional zones accept random writes, conventioanl zone
+	 * files can support shared writeable mappings. For sequential zone
+	 * files, only readonly mappings are possible since there no gurantees
+	 * for write ordering due to msync() and page cache writeback.
+	 */
+	if (ZONEFS_I(file_inode(file))->i_ztype == ZONEFS_ZTYPE_SEQ &&
+	    (vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+		return -EINVAL;
+
+	file_accessed(file);
+	vma->vm_ops = &zonefs_file_vm_ops;
+
+	return 0;
+}
+
+static loff_t zonefs_file_llseek(struct file *file, loff_t offset, int whence)
+{
+	loff_t isize = i_size_read(file_inode(file));
+
+	/*
+	 * Seeks are limited to below the zone size for conventional zones
+	 * and below the zone write pointer for sequential zones. In both
+	 * cases, this limit is the inode size.
+	 */
+	return generic_file_llseek_size(file, offset, whence, isize, isize);
+}
+
+static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	loff_t max_pos;
+	size_t count;
+	ssize_t ret;
+
+	if (iocb->ki_pos >= zi->i_max_size)
+		return 0;
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!inode_trylock_shared(inode))
+			return -EAGAIN;
+	} else {
+		inode_lock_shared(inode);
+	}
+
+	mutex_lock(&zi->i_truncate_mutex);
+
+	/*
+	 * Limit read operations to written data.
+	 */
+	max_pos = i_size_read(inode);
+	if (iocb->ki_pos >= max_pos) {
+		mutex_unlock(&zi->i_truncate_mutex);
+		ret = 0;
+		goto out;
+	}
+
+	iov_iter_truncate(to, max_pos - iocb->ki_pos);
+
+	mutex_unlock(&zi->i_truncate_mutex);
+
+	count = iov_iter_count(to);
+
+	if (iocb->ki_flags & IOCB_DIRECT) {
+		if ((iocb->ki_pos | count) & sbi->s_blocksize_mask) {
+			ret = -EINVAL;
+			goto out;
+		}
+		file_accessed(iocb->ki_filp);
+		ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops, NULL,
+				   is_sync_kiocb(iocb));
+	} else {
+		ret = generic_file_read_iter(iocb, to);
+	}
+
+out:
+	inode_unlock_shared(inode);
+
+	return ret;
+}
+
+static int zonefs_report_zones_err_cb(struct blk_zone *zone, unsigned int idx,
+				      void *data)
+{
+	struct inode *inode = data;
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	loff_t pos;
+
+	/*
+	 * The condition of the zone may have change. Check it and adjust the
+	 * inode information as needed, similarly to zonefs_init_file_inode().
+	 */
+	if (zone->cond == BLK_ZONE_COND_OFFLINE) {
+		inode->i_flags |= S_IMMUTABLE;
+		inode->i_mode &= ~0777;
+		zone->wp = zone->start;
+	} else if (zone->cond == BLK_ZONE_COND_READONLY) {
+		inode->i_flags |= S_IMMUTABLE;
+		inode->i_mode &= ~0222;
+	}
+
+	pos = (zone->wp - zone->start) << SECTOR_SHIFT;
+	zi->i_wpoffset = pos;
+	if (i_size_read(inode) != pos) {
+		zonefs_update_stats(inode, pos);
+		i_size_write(inode, pos);
+	}
+
+	return 0;
+}
+
+/*
+ * When a write error occurs in a sequential zone, the zone write pointer
+ * position must be refreshed to correct the file size and zonefs inode
+ * write pointer offset.
+ */
+static int zonefs_seq_file_write_failed(struct inode *inode, int error)
+{
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct super_block *sb = inode->i_sb;
+	sector_t sector = zi->i_zsector;
+	unsigned int nofs_flag;
+	int ret;
+
+	zonefs_warn(sb, "Updating inode zone %llu info\n", sector);
+
+	/*
+	 * blkdev_report_zones() uses GFP_KERNEL by default. Force execution as
+	 * if GFP_NOFS was specified so that it will not end up recursing into
+	 * the FS on memory allocation.
+	 */
+	nofs_flag = memalloc_nofs_save();
+	ret = blkdev_report_zones(sb->s_bdev, sector, 1,
+				  zonefs_report_zones_err_cb, inode);
+	memalloc_nofs_restore(nofs_flag);
+
+	if (ret != 1) {
+		if (!ret)
+			ret = -EIO;
+		zonefs_err(sb, "Get zone %llu report failed %d\n",
+			   sector, ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static int zonefs_file_dio_write_end(struct kiocb *iocb, ssize_t size, int ret,
+				     unsigned int flags)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+
+	if (ret)
+		return ret;
+
+	/*
+	 * Conventional zone file size is fixed to the zone size so there
+	 * is no need to do anything.
+	 */
+	if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
+		return 0;
+
+	mutex_lock(&zi->i_truncate_mutex);
+
+	if (size < 0) {
+		ret = zonefs_seq_file_write_failed(inode, size);
+	} else if (i_size_read(inode) < iocb->ki_pos + size) {
+		zonefs_update_stats(inode, iocb->ki_pos + size);
+		i_size_write(inode, iocb->ki_pos + size);
+	}
+
+	mutex_unlock(&zi->i_truncate_mutex);
+
+	return ret;
+}
+
+static const struct iomap_dio_ops zonefs_dio_ops = {
+	.end_io			= zonefs_file_dio_write_end,
+};
+
+static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	size_t count;
+	ssize_t ret;
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!inode_trylock(inode))
+			return -EAGAIN;
+	} else {
+		inode_lock(inode);
+	}
+
+	ret = generic_write_checks(iocb, from);
+	if (ret <= 0)
+		goto out;
+
+	iov_iter_truncate(from, zi->i_max_size - iocb->ki_pos);
+	count = iov_iter_count(from);
+
+	/*
+	 * Direct writes must be aligned to the block size, that is, the device
+	 * physical sector size, to avoid errors when writing sequential zones
+	 * on 512e devices (512B logical sector, 4KB physical sectors).
+	 */
+	if ((iocb->ki_pos | count) & sbi->s_blocksize_mask) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * Enforce sequential writes (append only) in sequential zones.
+	 */
+	mutex_lock(&zi->i_truncate_mutex);
+	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
+	    iocb->ki_pos != zi->i_wpoffset) {
+		zonefs_err(inode->i_sb,
+			   "Unaligned write at %llu + %zu (wp %llu)\n",
+			   iocb->ki_pos, count,
+			   zi->i_wpoffset);
+		mutex_unlock(&zi->i_truncate_mutex);
+		ret = -EINVAL;
+		goto out;
+	}
+	mutex_unlock(&zi->i_truncate_mutex);
+
+	ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops, &zonefs_dio_ops,
+			   is_sync_kiocb(iocb));
+	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
+	    (ret > 0 || ret == -EIOCBQUEUED)) {
+		if (ret > 0)
+			count = ret;
+		mutex_lock(&zi->i_truncate_mutex);
+		zi->i_wpoffset += count;
+		mutex_unlock(&zi->i_truncate_mutex);
+	}
+
+out:
+	inode_unlock(inode);
+
+	return ret;
+}
+
+static ssize_t zonefs_file_buffered_write(struct kiocb *iocb,
+					  struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	ssize_t ret;
+
+	/*
+	 * Direct IO writes are mandatory for sequential zones so that the
+	 * write IO order is preserved.
+	 */
+	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ)
+		return -EIO;
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!inode_trylock(inode))
+			return -EAGAIN;
+	} else {
+		inode_lock(inode);
+	}
+
+	ret = generic_write_checks(iocb, from);
+	if (ret <= 0)
+		goto out;
+
+	iov_iter_truncate(from, zi->i_max_size - iocb->ki_pos);
+
+	ret = iomap_file_buffered_write(iocb, from, &zonefs_iomap_ops);
+	if (ret > 0)
+		iocb->ki_pos += ret;
+
+out:
+	inode_unlock(inode);
+	if (ret > 0)
+		ret = generic_write_sync(iocb, ret);
+
+	return ret;
+}
+
+static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	/*
+	 * Check that the write operation does not go beyond the zone size.
+	 */
+	if (iocb->ki_pos >= ZONEFS_I(inode)->i_max_size)
+		return -EFBIG;
+
+	if (iocb->ki_flags & IOCB_DIRECT)
+		return zonefs_file_dio_write(iocb, from);
+
+	return zonefs_file_buffered_write(iocb, from);
+}
+
+static const struct file_operations zonefs_file_operations = {
+	.open		= generic_file_open,
+	.fsync		= zonefs_file_fsync,
+	.mmap		= zonefs_file_mmap,
+	.llseek		= zonefs_file_llseek,
+	.read_iter	= zonefs_file_read_iter,
+	.write_iter	= zonefs_file_write_iter,
+	.splice_read	= generic_file_splice_read,
+	.splice_write	= iter_file_splice_write,
+	.iopoll		= iomap_dio_iopoll,
+};
+
+static struct kmem_cache *zonefs_inode_cachep;
+
+static struct inode *zonefs_alloc_inode(struct super_block *sb)
+{
+	struct zonefs_inode_info *zi;
+
+	zi = kmem_cache_alloc(zonefs_inode_cachep, GFP_KERNEL);
+	if (!zi)
+		return NULL;
+
+	inode_init_once(&zi->i_vnode);
+	mutex_init(&zi->i_truncate_mutex);
+	init_rwsem(&zi->i_mmap_sem);
+
+	return &zi->i_vnode;
+}
+
+static void zonefs_free_inode(struct inode *inode)
+{
+	kmem_cache_free(zonefs_inode_cachep, ZONEFS_I(inode));
+}
+
+/*
+ * File system stat.
+ */
+static int zonefs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+	struct super_block *sb = dentry->d_sb;
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+	enum zonefs_ztype t;
+	u64 fsid;
+
+	buf->f_type = ZONEFS_MAGIC;
+	buf->f_bsize = sb->s_blocksize;
+	buf->f_namelen = ZONEFS_NAME_MAX;
+
+	spin_lock(&sbi->s_lock);
+
+	buf->f_blocks = sbi->s_blocks;
+	if (WARN_ON(sbi->s_used_blocks > sbi->s_blocks))
+		buf->f_bfree = 0;
+	else
+		buf->f_bfree = buf->f_blocks - sbi->s_used_blocks;
+	buf->f_bavail = buf->f_bfree;
+
+	for (t = 0; t < ZONEFS_ZTYPE_MAX; t++) {
+		if (sbi->s_nr_files[t])
+			buf->f_files += sbi->s_nr_files[t] + 1;
+	}
+	buf->f_ffree = 0;
+
+	spin_unlock(&sbi->s_lock);
+
+	fsid = le64_to_cpup((void *)sbi->s_uuid.b) ^
+		le64_to_cpup((void *)sbi->s_uuid.b + sizeof(u64));
+	buf->f_fsid.val[0] = (u32)fsid;
+	buf->f_fsid.val[1] = (u32)(fsid >> 32);
+
+	return 0;
+}
+
+static const struct super_operations zonefs_sops = {
+	.alloc_inode	= zonefs_alloc_inode,
+	.free_inode	= zonefs_free_inode,
+	.statfs		= zonefs_statfs,
+};
+
+static const struct inode_operations zonefs_dir_inode_operations = {
+	.lookup		= simple_lookup,
+	.setattr	= zonefs_inode_setattr,
+};
+
+static void zonefs_init_dir_inode(struct inode *parent, struct inode *inode)
+{
+	inode_init_owner(inode, parent, S_IFDIR | 0555);
+	inode->i_op = &zonefs_dir_inode_operations;
+	inode->i_fop = &simple_dir_operations;
+	set_nlink(inode, 2);
+	inc_nlink(parent);
+}
+
+static void zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone)
+{
+	struct super_block *sb = inode->i_sb;
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	umode_t	perm = sbi->s_perm;
+
+	if (zone->cond == BLK_ZONE_COND_OFFLINE) {
+		/*
+		 * Dead zone: make the inode immutable, disable all accesses
+		 * and set the file size to 0.
+		 */
+		inode->i_flags |= S_IMMUTABLE;
+		zone->wp = zone->start;
+		perm &= ~0777;
+	} else if (zone->cond == BLK_ZONE_COND_READONLY) {
+		/* Do not allow writes in read-only zones */
+		inode->i_flags |= S_IMMUTABLE;
+		perm &= ~0222;
+	}
+
+	zi->i_ztype = zonefs_zone_type(zone);
+	zi->i_zsector = zone->start;
+	zi->i_max_size = min_t(loff_t, MAX_LFS_FILESIZE,
+			       zone->len << SECTOR_SHIFT);
+	if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
+		zi->i_wpoffset = zi->i_max_size;
+	else
+		zi->i_wpoffset = (zone->wp - zone->start) << SECTOR_SHIFT;
+
+	inode->i_mode = S_IFREG | perm;
+	inode->i_uid = sbi->s_uid;
+	inode->i_gid = sbi->s_gid;
+	inode->i_size = zi->i_wpoffset;
+	inode->i_blocks = zone->len;
+
+	inode->i_op = &zonefs_file_inode_operations;
+	inode->i_fop = &zonefs_file_operations;
+	inode->i_mapping->a_ops = &zonefs_file_aops;
+
+	sb->s_maxbytes = max(zi->i_max_size, sb->s_maxbytes);
+	sbi->s_blocks += zi->i_max_size >> sb->s_blocksize_bits;
+	sbi->s_used_blocks += zi->i_wpoffset >> sb->s_blocksize_bits;
+}
+
+static struct dentry *zonefs_create_inode(struct dentry *parent,
+					const char *name, struct blk_zone *zone)
+{
+	struct inode *dir = d_inode(parent);
+	struct dentry *dentry;
+	struct inode *inode;
+
+	dentry = d_alloc_name(parent, name);
+	if (!dentry)
+		return NULL;
+
+	inode = new_inode(parent->d_sb);
+	if (!inode)
+		goto out;
+
+	inode->i_ino = get_next_ino();
+	inode->i_ctime = inode->i_mtime = inode->i_atime = dir->i_ctime;
+	if (zone)
+		zonefs_init_file_inode(inode, zone);
+	else
+		zonefs_init_dir_inode(dir, inode);
+	d_add(dentry, inode);
+	dir->i_size++;
+
+	return dentry;
+
+out:
+	dput(dentry);
+
+	return NULL;
+}
+
+static char *zgroups_name[ZONEFS_ZTYPE_MAX] = { "cnv", "seq" };
+
+struct zonefs_zone_data {
+	struct super_block *sb;
+	unsigned int nr_zones[ZONEFS_ZTYPE_MAX];
+	struct blk_zone *zones;
+};
+
+/*
+ * Create a zone group and populate it with zone files.
+ */
+static int zonefs_create_zgroup(struct zonefs_zone_data *zd,
+				enum zonefs_ztype type)
+{
+	struct super_block *sb = zd->sb;
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+	struct blk_zone *zone, *next, *end;
+	char name[ZONEFS_NAME_MAX];
+	struct dentry *dir;
+	unsigned int n = 0;
+
+	/* If the group is empty, there is nothing to do */
+	if (!zd->nr_zones[type])
+		return 0;
+
+	dir = zonefs_create_inode(sb->s_root, zgroups_name[type], NULL);
+	if (!dir)
+		return -ENOMEM;
+
+	/*
+	 * The first zone contains the super block: skip it.
+	 */
+	end = zd->zones + blkdev_nr_zones(sb->s_bdev->bd_disk);
+	for (zone = &zd->zones[1]; zone < end; zone = next) {
+
+		next = zone + 1;
+		if (zonefs_zone_type(zone) != type)
+			continue;
+
+		/*
+		 * For conventional zones, contiguous zones can be aggregated
+		 * together to form larger files.
+		 * Note that this overwrites the length of the first zone of
+		 * the set of contiguous zones aggregated together.
+		 * Only zones with the same condition can be agreggated so that
+		 * offline zones are excluded and readonly zones are aggregated
+		 * together into a read only file.
+		 */
+		if (type == ZONEFS_ZTYPE_CNV &&
+		    (sbi->s_features & ZONEFS_F_AGGRCNV)) {
+			for (; next < end; next++) {
+				if (zonefs_zone_type(next) != type ||
+				    next->cond != zone->cond)
+					break;
+				zone->len += next->len;
+			}
+		}
+
+		/*
+		 * Use the file number within its group as file name.
+		 */
+		snprintf(name, ZONEFS_NAME_MAX - 1, "%u", n);
+		if (!zonefs_create_inode(dir, name, zone))
+			return -ENOMEM;
+
+		n++;
+	}
+
+	zonefs_info(sb, "Zone group \"%s\" has %u file%s\n",
+		    zgroups_name[type], n, n > 1 ? "s" : "");
+
+	sbi->s_nr_files[type] = n;
+
+	return 0;
+}
+
+static int zonefs_get_zone_info_cb(struct blk_zone *zone, unsigned int idx,
+				   void *data)
+{
+	struct zonefs_zone_data *zd = data;
+
+	/*
+	 * Count the number of usable zones: the first zone at index 0 contains
+	 * the super block and is ignored.
+	 */
+	switch (zone->type) {
+	case BLK_ZONE_TYPE_CONVENTIONAL:
+		zone->wp = zone->start + zone->len;
+		if (idx)
+			zd->nr_zones[ZONEFS_ZTYPE_CNV]++;
+		break;
+	case BLK_ZONE_TYPE_SEQWRITE_REQ:
+	case BLK_ZONE_TYPE_SEQWRITE_PREF:
+		if (idx)
+			zd->nr_zones[ZONEFS_ZTYPE_SEQ]++;
+		break;
+	default:
+		zonefs_err(zd->sb, "Unsupported zone type 0x%x\n",
+			   zone->type);
+		return -EIO;
+	}
+
+	memcpy(&zd->zones[idx], zone, sizeof(struct blk_zone));
+
+	return 0;
+}
+
+static int zonefs_get_zone_info(struct zonefs_zone_data *zd)
+{
+	struct block_device *bdev = zd->sb->s_bdev;
+	int ret;
+
+	zd->zones = kvcalloc(blkdev_nr_zones(bdev->bd_disk),
+			     sizeof(struct blk_zone), GFP_KERNEL);
+	if (!zd->zones)
+		return -ENOMEM;
+
+	/* Get zones information */
+	ret = blkdev_report_zones(bdev, 0, BLK_ALL_ZONES,
+				  zonefs_get_zone_info_cb, zd);
+	if (ret < 0) {
+		zonefs_err(zd->sb, "Zone report failed %d\n", ret);
+		return ret;
+	}
+
+	if (ret != blkdev_nr_zones(bdev->bd_disk)) {
+		zonefs_err(zd->sb, "Invalid zone report (%d/%u zones)\n",
+			   ret, blkdev_nr_zones(bdev->bd_disk));
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static inline void zonefs_cleanup_zone_info(struct zonefs_zone_data *zd)
+{
+	kvfree(zd->zones);
+}
+
+/*
+ * Read super block information from the device.
+ */
+static int zonefs_read_super(struct super_block *sb)
+{
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+	struct zonefs_super *super;
+	u32 crc, stored_crc;
+	struct page *page;
+	struct bio_vec bio_vec;
+	struct bio bio;
+	int ret;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+
+	bio_init(&bio, &bio_vec, 1);
+	bio.bi_iter.bi_sector = 0;
+	bio_set_dev(&bio, sb->s_bdev);
+	bio_set_op_attrs(&bio, REQ_OP_READ, 0);
+	bio_add_page(&bio, page, PAGE_SIZE, 0);
+
+	ret = submit_bio_wait(&bio);
+	if (ret)
+		goto out;
+
+	super = page_address(page);
+
+	stored_crc = le32_to_cpu(super->s_crc);
+	super->s_crc = 0;
+	crc = crc32(~0U, (unsigned char *)super, sizeof(struct zonefs_super));
+	if (crc != stored_crc) {
+		zonefs_err(sb, "Invalid checksum (Expected 0x%08x, got 0x%08x)",
+			   crc, stored_crc);
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = -EINVAL;
+	if (le32_to_cpu(super->s_magic) != ZONEFS_MAGIC)
+		goto out;
+
+	sbi->s_features = le64_to_cpu(super->s_features);
+	if (sbi->s_features & ~ZONEFS_F_DEFINED_FEATURES) {
+		zonefs_err(sb, "Unknown features set 0x%llx\n",
+			   sbi->s_features);
+		goto out;
+	}
+
+	if (sbi->s_features & ZONEFS_F_UID) {
+		sbi->s_uid = make_kuid(current_user_ns(),
+				       le32_to_cpu(super->s_uid));
+		if (!uid_valid(sbi->s_uid)) {
+			zonefs_err(sb, "Invalid UID feature\n");
+			goto out;
+		}
+	}
+
+	if (sbi->s_features & ZONEFS_F_GID) {
+		sbi->s_gid = make_kgid(current_user_ns(),
+				       le32_to_cpu(super->s_gid));
+		if (!gid_valid(sbi->s_gid)) {
+			zonefs_err(sb, "Invalid GID feature\n");
+			goto out;
+		}
+	}
+
+	if (sbi->s_features & ZONEFS_F_PERM)
+		sbi->s_perm = le32_to_cpu(super->s_perm);
+
+	if (memchr_inv(super->s_reserved, 0, sizeof(super->s_reserved))) {
+		zonefs_err(sb, "Reserved area is being used\n");
+		goto out;
+	}
+
+	uuid_copy(&sbi->s_uuid, (uuid_t *)super->s_uuid);
+	ret = 0;
+
+out:
+	__free_page(page);
+
+	return ret;
+}
+
+/*
+ * Check that the device is zoned. If it is, get the list of zones and create
+ * sub-directories and files according to the device zone configuration and
+ * format options.
+ */
+static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct zonefs_zone_data zd;
+	struct zonefs_sb_info *sbi;
+	struct inode *inode;
+	enum zonefs_ztype t;
+	int ret;
+
+	if (!bdev_is_zoned(sb->s_bdev)) {
+		zonefs_err(sb, "Not a zoned block device\n");
+		return -EINVAL;
+	}
+
+	/*
+	 * Initialize super block information: the maximum file size is updated
+	 * when the zone files are created so that the format option
+	 * ZONEFS_F_AGGRCNV which increases the maximum file size of a file
+	 * beyond the zone size is taken into account.
+	 */
+	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
+	if (!sbi)
+		return -ENOMEM;
+
+	spin_lock_init(&sbi->s_lock);
+	sb->s_fs_info = sbi;
+	sb->s_magic = ZONEFS_MAGIC;
+	sb->s_maxbytes = 0;
+	sb->s_op = &zonefs_sops;
+	sb->s_time_gran	= 1;
+
+	/*
+	 * The block size is always equal to the device physical sector size to
+	 * ensure that writes on 512e devices (512B logical block and 4KB
+	 * physical block) are always aligned to the device physical blocks
+	 * (as required for writes to sequential zones on ZBC/ZAC disks).
+	 */
+	sb_set_blocksize(sb, bdev_physical_block_size(sb->s_bdev));
+	sbi->s_blocksize_mask = sb->s_blocksize - 1;
+	sbi->s_uid = GLOBAL_ROOT_UID;
+	sbi->s_gid = GLOBAL_ROOT_GID;
+	sbi->s_perm = 0640;
+
+	ret = zonefs_read_super(sb);
+	if (ret)
+		return ret;
+
+	memset(&zd, 0, sizeof(struct zonefs_zone_data));
+	zd.sb = sb;
+	ret = zonefs_get_zone_info(&zd);
+	if (ret)
+		goto out;
+
+	zonefs_info(sb, "Mounting %u zones",
+		    blkdev_nr_zones(sb->s_bdev->bd_disk));
+
+	/* Create root directory inode */
+	ret = -ENOMEM;
+	inode = new_inode(sb);
+	if (!inode)
+		goto out;
+
+	inode->i_ino = get_next_ino();
+	inode->i_mode = S_IFDIR | 0555;
+	inode->i_ctime = inode->i_mtime = inode->i_atime = current_time(inode);
+	inode->i_op = &zonefs_dir_inode_operations;
+	inode->i_fop = &simple_dir_operations;
+	set_nlink(inode, 2);
+
+	sb->s_root = d_make_root(inode);
+	if (!sb->s_root)
+		goto out;
+
+	/* Create and populate files in zone groups directories */
+	for (t = 0; t < ZONEFS_ZTYPE_MAX; t++) {
+		ret = zonefs_create_zgroup(&zd, t);
+		if (ret)
+			break;
+	}
+
+out:
+	zonefs_cleanup_zone_info(&zd);
+
+	return ret;
+}
+
+static struct dentry *zonefs_mount(struct file_system_type *fs_type,
+				   int flags, const char *dev_name, void *data)
+{
+	return mount_bdev(fs_type, flags, dev_name, data, zonefs_fill_super);
+}
+
+static void zonefs_kill_super(struct super_block *sb)
+{
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+
+	kfree(sbi);
+	if (sb->s_root)
+		d_genocide(sb->s_root);
+	kill_block_super(sb);
+}
+
+/*
+ * File system definition and registration.
+ */
+static struct file_system_type zonefs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "zonefs",
+	.mount		= zonefs_mount,
+	.kill_sb	= zonefs_kill_super,
+	.fs_flags	= FS_REQUIRES_DEV,
+};
+
+static int __init zonefs_init_inodecache(void)
+{
+	zonefs_inode_cachep = kmem_cache_create("zonefs_inode_cache",
+			sizeof(struct zonefs_inode_info), 0,
+			(SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD | SLAB_ACCOUNT),
+			NULL);
+	if (zonefs_inode_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+static void zonefs_destroy_inodecache(void)
+{
+	/*
+	 * Make sure all delayed rcu free inodes are flushed before we
+	 * destroy the inode cache.
+	 */
+	rcu_barrier();
+	kmem_cache_destroy(zonefs_inode_cachep);
+}
+
+static int __init zonefs_init(void)
+{
+	int ret;
+
+	BUILD_BUG_ON(sizeof(struct zonefs_super) != ZONEFS_SUPER_SIZE);
+
+	ret = zonefs_init_inodecache();
+	if (ret)
+		return ret;
+
+	ret = register_filesystem(&zonefs_type);
+	if (ret) {
+		zonefs_destroy_inodecache();
+		return ret;
+	}
+
+	return 0;
+}
+
+static void __exit zonefs_exit(void)
+{
+	zonefs_destroy_inodecache();
+	unregister_filesystem(&zonefs_type);
+}
+
+MODULE_AUTHOR("Damien Le Moal");
+MODULE_DESCRIPTION("Zone file system for zoned block devices");
+MODULE_LICENSE("GPL");
+module_init(zonefs_init);
+module_exit(zonefs_exit);
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
new file mode 100644
index 000000000000..5862e17573ba
--- /dev/null
+++ b/fs/zonefs/zonefs.h
@@ -0,0 +1,175 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Simple zone file system for zoned block devices.
+ *
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ */
+#ifndef __ZONEFS_H__
+#define __ZONEFS_H__
+
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <linux/uuid.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+
+/*
+ * Maximum length of file names: this only needs to be large enough to fit
+ * the zone group directory names and a decimal zone number for file names.
+ * 16 characters is plenty.
+ */
+#define ZONEFS_NAME_MAX		16
+
+/*
+ * Zone types: ZONEFS_ZTYPE_SEQ is used for all sequential zone types
+ * defined in linux/blkzoned.h, that is, BLK_ZONE_TYPE_SEQWRITE_REQ and
+ * BLK_ZONE_TYPE_SEQWRITE_PREF.
+ */
+enum zonefs_ztype {
+	ZONEFS_ZTYPE_CNV,
+	ZONEFS_ZTYPE_SEQ,
+	ZONEFS_ZTYPE_MAX,
+};
+
+static inline enum zonefs_ztype zonefs_zone_type(struct blk_zone *zone)
+{
+	if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL)
+		return ZONEFS_ZTYPE_CNV;
+	return ZONEFS_ZTYPE_SEQ;
+}
+
+/*
+ * In-memory inode data.
+ */
+struct zonefs_inode_info {
+	struct inode		i_vnode;
+
+	/* File zone type */
+	enum zonefs_ztype	i_ztype;
+
+	/* File zone start sector (512B unit) */
+	sector_t		i_zsector;
+
+	/* File zone write pointer position (sequential zones only) */
+	loff_t			i_wpoffset;
+
+	/* File maximum size */
+	loff_t			i_max_size;
+
+	/*
+	 * To serialise fully against both syscall and mmap based IO and
+	 * sequential file truncation, two locks are used. For serializing
+	 * zonefs_seq_file_truncate() against zonefs_iomap_begin(), that is,
+	 * file truncate operations against block mapping, i_truncate_mutex is
+	 * used. i_truncate_mutex also protects against concurrent accesses
+	 * and changes to the inode private data, and in particular changes to
+	 * a sequential file size on completion of direct IO writes.
+	 * Serialization of mmap read IOs with truncate and syscall IO
+	 * operations is done with i_mmap_sem in addition to i_truncate_mutex.
+	 * Only zonefs_seq_file_truncate() takes both lock (i_mmap_sem first,
+	 * i_truncate_mutex second).
+	 */
+	struct mutex		i_truncate_mutex;
+	struct rw_semaphore	i_mmap_sem;
+};
+
+static inline struct zonefs_inode_info *ZONEFS_I(struct inode *inode)
+{
+	return container_of(inode, struct zonefs_inode_info, i_vnode);
+}
+
+/*
+ * On-disk super block (block 0).
+ */
+#define ZONEFS_LABEL_LEN	64
+#define ZONEFS_UUID_SIZE	16
+#define ZONEFS_SUPER_SIZE	4096
+
+struct zonefs_super {
+
+	/* Magic number */
+	__le32		s_magic;
+
+	/* Checksum */
+	__le32		s_crc;
+
+	/* Volume label */
+	char		s_label[ZONEFS_LABEL_LEN];
+
+	/* 128-bit uuid */
+	__u8		s_uuid[ZONEFS_UUID_SIZE];
+
+	/* Features */
+	__le64		s_features;
+
+	/* UID/GID to use for files */
+	__le32		s_uid;
+	__le32		s_gid;
+
+	/* File permissions */
+	__le32		s_perm;
+
+	/* Padding to ZONEFS_SUPER_SIZE bytes */
+	__u8		s_reserved[3988];
+
+} __packed;
+
+/*
+ * Feature flags: used on disk in the s_features field of struct zonefs_super
+ * and in-memory in the s_feartures field of struct zonefs_sb_info.
+ */
+enum zonefs_features {
+	/*
+	 * Aggregate contiguous conventional zones into a single file.
+	 */
+	ZONEFS_F_AGGRCNV = 1ULL << 0,
+	/*
+	 * Use super block specified UID for files instead of default.
+	 */
+	ZONEFS_F_UID = 1ULL << 1,
+	/*
+	 * Use super block specified GID for files instead of default.
+	 */
+	ZONEFS_F_GID = 1ULL << 2,
+	/*
+	 * Use super block specified file permissions instead of default 640.
+	 */
+	ZONEFS_F_PERM = 1ULL << 3,
+};
+
+#define ZONEFS_F_DEFINED_FEATURES \
+	(ZONEFS_F_AGGRCNV | ZONEFS_F_UID | ZONEFS_F_GID | ZONEFS_F_PERM)
+
+/*
+ * In-memory Super block information.
+ */
+struct zonefs_sb_info {
+
+	spinlock_t		s_lock;
+
+	unsigned long long	s_features;
+	kuid_t			s_uid;
+	kgid_t			s_gid;
+	umode_t			s_perm;
+	uuid_t			s_uuid;
+	loff_t			s_blocksize_mask;
+
+	unsigned int		s_nr_files[ZONEFS_ZTYPE_MAX];
+
+	loff_t			s_blocks;
+	loff_t			s_used_blocks;
+};
+
+static inline struct zonefs_sb_info *ZONEFS_SB(struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
+#define zonefs_info(sb, format, args...)	\
+	pr_info("zonefs (%s): " format, sb->s_id, ## args)
+#define zonefs_err(sb, format, args...)	\
+	pr_err("zonefs (%s) ERROR: " format, sb->s_id, ## args)
+#define zonefs_warn(sb, format, args...)	\
+	pr_warn("zonefs (%s) WARN: " format, sb->s_id, ## args)
+
+#endif
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 3ac436376d79..d78064007b17 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -87,6 +87,7 @@ 
 #define NSFS_MAGIC		0x6e736673
 #define BPF_FS_MAGIC		0xcafe4a11
 #define AAFS_MAGIC		0x5a3c69f0
+#define ZONEFS_MAGIC		0x5a4f4653
 
 /* Since UDF 2.01 is ISO 13346 based... */
 #define UDF_SUPER_MAGIC		0x15013346