Message ID | 20201223150136.4221-1-minwoo.im.dev@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] nvme: set block size during namespace validation | expand |
set_blocksize just sets the block sise used for buffer heads and should not be called by the driver. blkdev_get updates the block size, so you must already have the fd re-reading the partition table open? I'm not entirely sure how we can work around this except by avoiding buffer head I/O in the partition reread code. Note that this affects all block drivers where the block size could change at runtime.
Hello, On 20-12-23 16:49:04, Christoph Hellwig wrote: > set_blocksize just sets the block sise used for buffer heads and should > not be called by the driver. blkdev_get updates the block size, so > you must already have the fd re-reading the partition table open? > I'm not entirely sure how we can work around this except by avoiding > buffer head I/O in the partition reread code. Note that this affects > all block drivers where the block size could change at runtime. Thank you Christoph for your comment on this. Agreed. BLKRRPART leads us to block_read_full_page which takes buffer heads for I/O. Yes, __blkdev_get() sets i_blkbits of block device inode via set_init_blocksize. And Yes again as nvme-cli already opened the block device fd and requests the BLKRRPART with that fd. Also, __bdev_get() only updates the i_blkbits(blocksize) in case bdev->bd_openers == 0 which is the first time to open this block device. Then, how about having NVMe driver prevent underflow case for the request->__data_len is smaller than the logical block size like: diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index ce1b61519441..030353d203bf 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -803,7 +803,11 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, cmnd->rw.opcode = op; cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id); cmnd->rw.slba = cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req))); - cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1); + + if (unlikely(blk_rq_bytes(req) < (1 << ns->lba_shift))) + cmnd->rw.length = 0; + else + cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1); if (req_op(req) == REQ_OP_WRITE && ctrl->nr_streams) nvme_assign_write_stream(ctrl, req, &control, &dsmgmt); Thanks,
On Thu, Dec 24, 2020 at 01:16:50AM +0900, Minwoo Im wrote: > Hello, > > On 20-12-23 16:49:04, Christoph Hellwig wrote: > > set_blocksize just sets the block sise used for buffer heads and should > > not be called by the driver. blkdev_get updates the block size, so > > you must already have the fd re-reading the partition table open? > > I'm not entirely sure how we can work around this except by avoiding > > buffer head I/O in the partition reread code. Note that this affects > > all block drivers where the block size could change at runtime. > > Thank you Christoph for your comment on this. > > Agreed. BLKRRPART leads us to block_read_full_page which takes buffer > heads for I/O. > > Yes, __blkdev_get() sets i_blkbits of block device inode via > set_init_blocksize. And Yes again as nvme-cli already opened the block > device fd and requests the BLKRRPART with that fd. Also, __bdev_get() > only updates the i_blkbits(blocksize) in case bdev->bd_openers == 0 which > is the first time to open this block device. > > Then, how about having NVMe driver prevent underflow case for the > request->__data_len is smaller than the logical block size like: Not sure this helps. I think we need to fix this proper and in the block layer. The long term fix is to stop messing with i_blksize at all, but that is going to take very long. I think for now the only thing we can do is to set a flag in the gendisk when the block size changes and then reject all I/O until the next first open that sets the blocksize.
On 20-12-23 17:27:37, Christoph Hellwig wrote: > On Thu, Dec 24, 2020 at 01:16:50AM +0900, Minwoo Im wrote: > > Hello, > > > > On 20-12-23 16:49:04, Christoph Hellwig wrote: > > > set_blocksize just sets the block sise used for buffer heads and should > > > not be called by the driver. blkdev_get updates the block size, so > > > you must already have the fd re-reading the partition table open? > > > I'm not entirely sure how we can work around this except by avoiding > > > buffer head I/O in the partition reread code. Note that this affects > > > all block drivers where the block size could change at runtime. > > > > Thank you Christoph for your comment on this. > > > > Agreed. BLKRRPART leads us to block_read_full_page which takes buffer > > heads for I/O. > > > > Yes, __blkdev_get() sets i_blkbits of block device inode via > > set_init_blocksize. And Yes again as nvme-cli already opened the block > > device fd and requests the BLKRRPART with that fd. Also, __bdev_get() > > only updates the i_blkbits(blocksize) in case bdev->bd_openers == 0 which > > is the first time to open this block device. > > > > Then, how about having NVMe driver prevent underflow case for the > > request->__data_len is smaller than the logical block size like: > > Not sure this helps. I think we need to fix this proper and in the > block layer. The long term fix is to stop messing with i_blksize > at all, but that is going to take very long. Agreed. > > I think for now the only thing we can do is to set a flag in the > gendisk when the block size changes and then reject all I/O until > the next first open that sets the blocksize. Let me prepare work around patch for this issue soon. Thanks!
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index ce1b61519441..6f2e8eb78e00 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2085,6 +2085,8 @@ static void nvme_update_disk_info(struct gendisk *disk, } blk_queue_logical_block_size(disk->queue, bs); + set_blocksize(disk->part0, bs); + /* * Linux filesystems assume writing a single physical block is * an atomic operation. Hence limit the physical block size to the
Let's say we have 2 LBA format for 4096B and 512B LBA size for a NVMe namespace. Assume that current LBA format is 4096B and in case we convert namespace to 512B and 4096B back again: nvme format /dev/nvme0n1 --lbaf=1 --force # 512B nvme format /dev/nvme0n1 --lbaf=0 --force # 4096B Then we can see the following errors during the BLKRRPART ioctl from the nvme-cli format subcommand: [ 10.771740] blk_update_request: operation not supported error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [ 10.780262] Buffer I/O error on dev nvme0n1, logical block 0, async page read ... Also, we can see the Read commands followed by the Format command due to BLKRRPART ioctl with Number of LBAs to 0xffff which is under-flowed because the request for the Read commands are coming with 512B. kworker/0:1H-56 [000] .... 913.456922: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=1, cmdid=216, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=0, len=65535, ctrl=0x0, dsmgmt=0, reftag=0) ksoftirqd/0-9 [000] .Ns. 916.566351: nvme_complete_rq: nvme0: disk=nvme0n1, qid=1, cmdid=216, res=0x0, retries=0, flags=0x0, status=0x4002 Before we have commit 5ff9f19231a0 ("block: simplify set_init_blocksize"), block size used to be bumped up to the 4K(PAGE_SIZE) in this example and we have not seen these errors. But with this patch, we have to make sure that bdev->bd_inode->i_blkbits to make sure that BLKRRPART ioctl pass proper request length based on the changed logical block size. Currently, nvme_setup_rw() does not consider under-flow case when it fills the cmnd->rw.length: cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1); Maybe we can have some statement to prevent under-flow case here, but this patch set the blocksize to the block layer when validating namespace where the logical_block_size of request_queue also is set. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> --- drivers/nvme/host/core.c | 2 ++ 1 file changed, 2 insertions(+)