diff mbox series

[RFC] nvme: set block size during namespace validation

Message ID 20201223150136.4221-1-minwoo.im.dev@gmail.com (mailing list archive)
State New, archived
Headers show
Series [RFC] nvme: set block size during namespace validation | expand

Commit Message

Minwoo Im Dec. 23, 2020, 3:01 p.m. UTC
Let's say we have 2 LBA format for 4096B and 512B LBA size for a
NVMe namespace.  Assume that current LBA format is 4096B and in case
we convert namespace to 512B and 4096B back again:

  nvme format /dev/nvme0n1 --lbaf=1 --force  # 512B
  nvme format /dev/nvme0n1 --lbaf=0 --force  # 4096B

  Then we can see the following errors during the BLKRRPART ioctl from
the nvme-cli format subcommand:

  [   10.771740] blk_update_request: operation not supported error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
  [   10.780262] Buffer I/O error on dev nvme0n1, logical block 0, async page read
  ...

  Also, we can see the Read commands followed by the Format command due
to BLKRRPART ioctl with Number of LBAs to 0xffff which is under-flowed
because the request for the Read commands are coming with 512B.

  kworker/0:1H-56      [000] ....   913.456922: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=1, cmdid=216, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=0, len=65535, ctrl=0x0, dsmgmt=0, reftag=0)
  ksoftirqd/0-9       [000] .Ns.   916.566351: nvme_complete_rq: nvme0: disk=nvme0n1, qid=1, cmdid=216, res=0x0, retries=0, flags=0x0, status=0x4002

  Before we have commit 5ff9f19231a0 ("block: simplify
set_init_blocksize"), block size used to be bumped up to the
4K(PAGE_SIZE) in this example and we have not seen these errors.  But
with this patch, we have to make sure that bdev->bd_inode->i_blkbits to
make sure that BLKRRPART ioctl pass proper request length based on the
changed logical block size.

  Currently, nvme_setup_rw() does not consider under-flow case when it
fills the cmnd->rw.length:

  cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);

  Maybe we can have some statement to prevent under-flow case here, but
this patch set the blocksize to the block layer when validating
namespace where the logical_block_size of request_queue also is set.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
---
 drivers/nvme/host/core.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Christoph Hellwig Dec. 23, 2020, 3:49 p.m. UTC | #1
set_blocksize just sets the block sise used for buffer heads and should
not be called by the driver.  blkdev_get updates the block size, so
you must already have the fd re-reading the partition table open?
I'm not entirely sure how we can work around this except by avoiding
buffer head I/O in the partition reread code.  Note that this affects
all block drivers where the block size could change at runtime.
Minwoo Im Dec. 23, 2020, 4:16 p.m. UTC | #2
Hello,

On 20-12-23 16:49:04, Christoph Hellwig wrote:
> set_blocksize just sets the block sise used for buffer heads and should
> not be called by the driver.  blkdev_get updates the block size, so
> you must already have the fd re-reading the partition table open?
> I'm not entirely sure how we can work around this except by avoiding
> buffer head I/O in the partition reread code.  Note that this affects
> all block drivers where the block size could change at runtime.

Thank you Christoph for your comment on this.

Agreed.  BLKRRPART leads us to block_read_full_page which takes buffer
heads for I/O.

Yes, __blkdev_get() sets i_blkbits of block device inode via
set_init_blocksize.  And Yes again as nvme-cli already opened the block
device fd and requests the BLKRRPART with that fd.  Also, __bdev_get()
only updates the i_blkbits(blocksize) in case bdev->bd_openers == 0 which
is the first time to open this block device.

Then, how about having NVMe driver prevent underflow case for the
request->__data_len is smaller than the logical block size like:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ce1b61519441..030353d203bf 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -803,7 +803,11 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
        cmnd->rw.opcode = op;
        cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
        cmnd->rw.slba = cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
-       cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
+
+       if (unlikely(blk_rq_bytes(req) < (1 << ns->lba_shift)))
+               cmnd->rw.length = 0;
+       else
+               cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 
        if (req_op(req) == REQ_OP_WRITE && ctrl->nr_streams)
                nvme_assign_write_stream(ctrl, req, &control, &dsmgmt);

Thanks,
Christoph Hellwig Dec. 23, 2020, 4:27 p.m. UTC | #3
On Thu, Dec 24, 2020 at 01:16:50AM +0900, Minwoo Im wrote:
> Hello,
> 
> On 20-12-23 16:49:04, Christoph Hellwig wrote:
> > set_blocksize just sets the block sise used for buffer heads and should
> > not be called by the driver.  blkdev_get updates the block size, so
> > you must already have the fd re-reading the partition table open?
> > I'm not entirely sure how we can work around this except by avoiding
> > buffer head I/O in the partition reread code.  Note that this affects
> > all block drivers where the block size could change at runtime.
> 
> Thank you Christoph for your comment on this.
> 
> Agreed.  BLKRRPART leads us to block_read_full_page which takes buffer
> heads for I/O.
> 
> Yes, __blkdev_get() sets i_blkbits of block device inode via
> set_init_blocksize.  And Yes again as nvme-cli already opened the block
> device fd and requests the BLKRRPART with that fd.  Also, __bdev_get()
> only updates the i_blkbits(blocksize) in case bdev->bd_openers == 0 which
> is the first time to open this block device.
> 
> Then, how about having NVMe driver prevent underflow case for the
> request->__data_len is smaller than the logical block size like:

Not sure this helps.  I think we need to fix this proper and in the
block layer.  The long term fix is to stop messing with i_blksize
at all, but that is going to take very long.

I think for now the only thing we can do is to set a flag in the
gendisk when the block size changes and then reject all I/O until
the next first open that sets the blocksize.
Minwoo Im Dec. 23, 2020, 6:31 p.m. UTC | #4
On 20-12-23 17:27:37, Christoph Hellwig wrote:
> On Thu, Dec 24, 2020 at 01:16:50AM +0900, Minwoo Im wrote:
> > Hello,
> > 
> > On 20-12-23 16:49:04, Christoph Hellwig wrote:
> > > set_blocksize just sets the block sise used for buffer heads and should
> > > not be called by the driver.  blkdev_get updates the block size, so
> > > you must already have the fd re-reading the partition table open?
> > > I'm not entirely sure how we can work around this except by avoiding
> > > buffer head I/O in the partition reread code.  Note that this affects
> > > all block drivers where the block size could change at runtime.
> > 
> > Thank you Christoph for your comment on this.
> > 
> > Agreed.  BLKRRPART leads us to block_read_full_page which takes buffer
> > heads for I/O.
> > 
> > Yes, __blkdev_get() sets i_blkbits of block device inode via
> > set_init_blocksize.  And Yes again as nvme-cli already opened the block
> > device fd and requests the BLKRRPART with that fd.  Also, __bdev_get()
> > only updates the i_blkbits(blocksize) in case bdev->bd_openers == 0 which
> > is the first time to open this block device.
> > 
> > Then, how about having NVMe driver prevent underflow case for the
> > request->__data_len is smaller than the logical block size like:
> 
> Not sure this helps.  I think we need to fix this proper and in the
> block layer.  The long term fix is to stop messing with i_blksize
> at all, but that is going to take very long.

Agreed.

> 
> I think for now the only thing we can do is to set a flag in the
> gendisk when the block size changes and then reject all I/O until
> the next first open that sets the blocksize.

Let me prepare work around patch for this issue soon.

Thanks!
diff mbox series

Patch

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ce1b61519441..6f2e8eb78e00 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2085,6 +2085,8 @@  static void nvme_update_disk_info(struct gendisk *disk,
 	}
 
 	blk_queue_logical_block_size(disk->queue, bs);
+	set_blocksize(disk->part0, bs);
+
 	/*
 	 * Linux filesystems assume writing a single physical block is
 	 * an atomic operation. Hence limit the physical block size to the