diff mbox series

[RFC,V2] block: reject I/O for same fd if block size changed

Message ID 20201230160129.23731-1-minwoo.im.dev@gmail.com (mailing list archive)
State New, archived
Headers show
Series [RFC,V2] block: reject I/O for same fd if block size changed | expand

Commit Message

Minwoo Im Dec. 30, 2020, 4:01 p.m. UTC
Let's say, for example of NVMe device, Format command to change out
LBA format to another logical block size and BLKRRPART to re-read
partition information with a same file descriptor like:

	fd = open("/dev/nvme0n1", O_RDONLY);

	nvme_format(fd, ...);
	if (ioctl(fd, BLKRRPART) < 0)
		..

In this case, ioctl causes invalid Read operations which are triggered
by buffer_head I/O path to re-read partition information.  This is
because it's still playing around with i_blksize and i_blkbits.  So,
512 -> 4096 -> 512 logical block size changes will cause an under-flowed
length of Read operations.

Case for NVMe:
  (LBAF 1 512B, LBAF 0 4K logical block size)

  nvme format /dev/nvme0n1 --lbaf=1 --force  # to 512B LBA
  nvme format /dev/nvme0n1 --lbaf=0 --force  # to 4096B LBA

[dmesg-snip]
  [   10.771740] blk_update_request: operation not supported error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
  [   10.780262] Buffer I/O error on dev nvme0n1, logical block 0, async page read

[event-snip]
  kworker/0:1H-56      [000] ....   913.456922: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=1, cmdid=216, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=0, len=65535, ctrl=0x0, dsmgmt=0, reftag=0)
   ksoftirqd/0-9       [000] .Ns.   916.566351: nvme_complete_rq: nvme0: disk=nvme0n1, qid=1, cmdid=216, res=0x0, retries=0, flags=0x0, status=0x4002

As the previous discussion [1], this patch introduced a gendisk flag
to indicate that block size has been changed in the runtime.  This flag
is set when logical block size is changed in the runtime in the block
layer.  It will be cleared when the file descriptor for the
block devie is opened again through __blkdev_get() which updates the block
size via set_init_blocksize().

This patch rejects I/O from the path of add_partitions() to avoid
issuing invalid Read operations to device.  It also sets a flag to
gendisk in blk_queue_logical_block_size to minimize caller-side updates.

[1] https://lore.kernel.org/linux-nvme/20201223183143.GB13354@localhost.localdomain/T/#t

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
---
 block/blk-settings.c    |  7 +++++++
 block/genhd.c           |  1 +
 block/partitions/core.c | 11 +++++++++++
 fs/block_dev.c          |  6 ++++++
 include/linux/genhd.h   |  1 +
 5 files changed, 26 insertions(+)

Comments

Chaitanya Kulkarni Jan. 4, 2021, 12:57 a.m. UTC | #1
On 12/30/20 08:03, Minwoo Im wrote:
> Let's say, for example of NVMe device, Format command to change out
> LBA format to another logical block size and BLKRRPART to re-read
> partition information with a same file descriptor like:
>
> 	fd = open("/dev/nvme0n1", O_RDONLY);
>
> 	nvme_format(fd, ...);
> 	if (ioctl(fd, BLKRRPART) < 0)
> 		..
>
> In this case, ioctl causes invalid Read operations which are triggered
> by buffer_head I/O path to re-read partition information.  This is
> because it's still playing around with i_blksize and i_blkbits.  So,
> 512 -> 4096 -> 512 logical block size changes will cause an under-flowed
> length of Read operations.
>
> Case for NVMe:
>   (LBAF 1 512B, LBAF 0 4K logical block size)
>
>   nvme format /dev/nvme0n1 --lbaf=1 --force  # to 512B LBA
>   nvme format /dev/nvme0n1 --lbaf=0 --force  # to 4096B LBA
>
> [dmesg-snip]
>   [   10.771740] blk_update_request: operation not supported error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
>   [   10.780262] Buffer I/O error on dev nvme0n1, logical block 0, async page read
>
> [event-snip]
>   kworker/0:1H-56      [000] ....   913.456922: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=1, cmdid=216, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=0, len=65535, ctrl=0x0, dsmgmt=0, reftag=0)
>    ksoftirqd/0-9       [000] .Ns.   916.566351: nvme_complete_rq: nvme0: disk=nvme0n1, qid=1, cmdid=216, res=0x0, retries=0, flags=0x0, status=0x4002
>
> As the previous discussion [1], this patch introduced a gendisk flag
> to indicate that block size has been changed in the runtime.  This flag
> is set when logical block size is changed in the runtime in the block
> layer.  It will be cleared when the file descriptor for the
> block devie is opened again through __blkdev_get() which updates the block
> size via set_init_blocksize().
>
> This patch rejects I/O from the path of add_partitions() to avoid
> issuing invalid Read operations to device.  It also sets a flag to
> gendisk in blk_queue_logical_block_size to minimize caller-side updates.
>
> [1] https://lore.kernel.org/linux-nvme/20201223183143.GB13354@localhost.localdomain/T/#t
>
> Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Rewrite the change-log similar to what we have in the repo and fix
the spelling mistakes. Add a cover-letter to explain the testcase
and the execution effect, also I'd move discussion link into
cover-letter.
Chaitanya Kulkarni Jan. 4, 2021, 1:02 a.m. UTC | #2
On 12/30/20 08:03, Minwoo Im wrote:
>  
> diff --git a/block/genhd.c b/block/genhd.c
> index 73faec438e49..c3a73cba7c88 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -81,6 +81,7 @@ bool set_capacity_and_notify(struct gendisk *disk, sector_t size)
>  	 */
>  	if (!capacity || !size)
>  		return false;
> +
>  	kobject_uevent_env(&disk_to_dev(disk)->kobj, KOBJ_CHANGE, envp);
>  	return true;
>  }
Avoid adding extra newlines in the patch, remove above extra line in v3.
Minwoo Im Jan. 4, 2021, 1:28 a.m. UTC | #3
On 21-01-04 00:57:07, Chaitanya Kulkarni wrote:
> On 12/30/20 08:03, Minwoo Im wrote:
> > Let's say, for example of NVMe device, Format command to change out
> > LBA format to another logical block size and BLKRRPART to re-read
> > partition information with a same file descriptor like:
> >
> > 	fd = open("/dev/nvme0n1", O_RDONLY);
> >
> > 	nvme_format(fd, ...);
> > 	if (ioctl(fd, BLKRRPART) < 0)
> > 		..
> >
> > In this case, ioctl causes invalid Read operations which are triggered
> > by buffer_head I/O path to re-read partition information.  This is
> > because it's still playing around with i_blksize and i_blkbits.  So,
> > 512 -> 4096 -> 512 logical block size changes will cause an under-flowed
> > length of Read operations.
> >
> > Case for NVMe:
> >   (LBAF 1 512B, LBAF 0 4K logical block size)
> >
> >   nvme format /dev/nvme0n1 --lbaf=1 --force  # to 512B LBA
> >   nvme format /dev/nvme0n1 --lbaf=0 --force  # to 4096B LBA
> >
> > [dmesg-snip]
> >   [   10.771740] blk_update_request: operation not supported error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
> >   [   10.780262] Buffer I/O error on dev nvme0n1, logical block 0, async page read
> >
> > [event-snip]
> >   kworker/0:1H-56      [000] ....   913.456922: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=1, cmdid=216, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=0, len=65535, ctrl=0x0, dsmgmt=0, reftag=0)
> >    ksoftirqd/0-9       [000] .Ns.   916.566351: nvme_complete_rq: nvme0: disk=nvme0n1, qid=1, cmdid=216, res=0x0, retries=0, flags=0x0, status=0x4002
> >
> > As the previous discussion [1], this patch introduced a gendisk flag
> > to indicate that block size has been changed in the runtime.  This flag
> > is set when logical block size is changed in the runtime in the block
> > layer.  It will be cleared when the file descriptor for the
> > block devie is opened again through __blkdev_get() which updates the block
> > size via set_init_blocksize().
> >
> > This patch rejects I/O from the path of add_partitions() to avoid
> > issuing invalid Read operations to device.  It also sets a flag to
> > gendisk in blk_queue_logical_block_size to minimize caller-side updates.
> >
> > [1] https://lore.kernel.org/linux-nvme/20201223183143.GB13354@localhost.localdomain/T/#t
> >
> > Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
> Rewrite the change-log similar to what we have in the repo and fix
> the spelling mistakes. Add a cover-letter to explain the testcase
> and the execution effect, also I'd move discussion link into
> cover-letter.

Thanks for your time.  Will prepare V3 with proper change logs and cover
letter to explain issue and testcase.

Thanks,
diff mbox series

Patch

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 43990b1d148b..205822ffcaef 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -328,6 +328,13 @@  EXPORT_SYMBOL(blk_queue_max_segment_size);
 void blk_queue_logical_block_size(struct request_queue *q, unsigned int size)
 {
 	struct queue_limits *limits = &q->limits;
+	struct block_device *bdev;
+
+	if (q->backing_dev_info && q->backing_dev_info->owner &&
+			limits->logical_block_size != size) {
+		bdev = blkdev_get_no_open(q->backing_dev_info->owner->devt);
+		bdev->bd_disk->flags |= GENHD_FL_BLOCK_SIZE_CHANGED;
+	}
 
 	limits->logical_block_size = size;
 
diff --git a/block/genhd.c b/block/genhd.c
index 73faec438e49..c3a73cba7c88 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -81,6 +81,7 @@  bool set_capacity_and_notify(struct gendisk *disk, sector_t size)
 	 */
 	if (!capacity || !size)
 		return false;
+
 	kobject_uevent_env(&disk_to_dev(disk)->kobj, KOBJ_CHANGE, envp);
 	return true;
 }
diff --git a/block/partitions/core.c b/block/partitions/core.c
index e7d776db803b..5a0330c1b6f9 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -618,6 +618,17 @@  int blk_add_partitions(struct gendisk *disk, struct block_device *bdev)
 	if (!disk_part_scan_enabled(disk))
 		return 0;
 
+	/*
+	 * Reject to check partition information if block size has been changed
+	 * in the runtime.  If block size of a block device has been changed,
+	 * the file descriptor should be opened agian to update the blkbits.
+	 */
+	if (disk->flags & GENHD_FL_BLOCK_SIZE_CHANGED) {
+		pr_warn("%s: rejecting checking partition. fd should be opened again.\n",
+				disk->disk_name);
+		return -EBADFD;
+	}
+
 	state = check_partition(disk, bdev);
 	if (!state)
 		return 0;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9293045e128c..c996de3d6084 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -131,6 +131,12 @@  EXPORT_SYMBOL(truncate_bdev_range);
 static void set_init_blocksize(struct block_device *bdev)
 {
 	bdev->bd_inode->i_blkbits = blksize_bits(bdev_logical_block_size(bdev));
+
+	/*
+	 * Allow I/O commands for this block device.  We can say that this
+	 * block device has been set to a proper block size.
+	 */
+	bdev->bd_disk->flags &= ~GENHD_FL_BLOCK_SIZE_CHANGED;
 }
 
 int set_blocksize(struct block_device *bdev, int size)
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 809aaa32d53c..0e0e24917003 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -103,6 +103,7 @@  struct partition_meta_info {
 #define GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE	0x0100
 #define GENHD_FL_NO_PART_SCAN			0x0200
 #define GENHD_FL_HIDDEN				0x0400
+#define GENHD_FL_BLOCK_SIZE_CHANGED		0x0800
 
 enum {
 	DISK_EVENT_MEDIA_CHANGE			= 1 << 0, /* media changed */