diff mbox series

block: fix sanity checks in blk_rq_map_user_bvec

Message ID 20241023211519.4177873-1-ushankar@purestorage.com (mailing list archive)
State New
Headers show
Series block: fix sanity checks in blk_rq_map_user_bvec | expand

Commit Message

Uday Shankar Oct. 23, 2024, 9:15 p.m. UTC
From: Xinyu Zhang <xizhang@purestorage.com>

blk_rq_map_user_bvec contains a check bytes + bv->bv_len > nr_iter which
causes unnecessary failures in NVMe passthrough I/O, reproducible as
follows:

- register a 2 page, page-aligned buffer against a ring
- use that buffer to do a 1 page io_uring NVMe passthrough read

The second (i = 1) iteration of the loop in blk_rq_map_user_bvec will
then have nr_iter == 1 page, bytes == 1 page, bv->bv_len == 1 page, so
the check bytes + bv->bv_len > nr_iter will succeed, causing the I/O to
fail. This failure is unnecessary, as when the check succeeds, it means
we've checked the entire buffer that will be used by the request - i.e.
blk_rq_map_user_bvec should complete successfully. Therefore, terminate
the loop early and return successfully when the check bytes + bv->bv_len
> nr_iter succeeds.

While we're at it, also remove the check that all segments in the bvec
are single-page. While this seems to be true for all users of the
function, it doesn't appear to be required anywhere downstream.

Signed-off-by: Xinyu Zhang <xizhang@purestorage.com>
Co-developed-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
---
 block/blk-map.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)


base-commit: d165768847839f8d1ae5f8081ecc018a190d50e8

Comments

Jens Axboe Oct. 23, 2024, 10:31 p.m. UTC | #1
On 10/23/24 3:15 PM, Uday Shankar wrote:
> From: Xinyu Zhang <xizhang@purestorage.com>
> 
> blk_rq_map_user_bvec contains a check bytes + bv->bv_len > nr_iter which
> causes unnecessary failures in NVMe passthrough I/O, reproducible as
> follows:
> 
> - register a 2 page, page-aligned buffer against a ring
> - use that buffer to do a 1 page io_uring NVMe passthrough read
> 
> The second (i = 1) iteration of the loop in blk_rq_map_user_bvec will
> then have nr_iter == 1 page, bytes == 1 page, bv->bv_len == 1 page, so
> the check bytes + bv->bv_len > nr_iter will succeed, causing the I/O to
> fail. This failure is unnecessary, as when the check succeeds, it means
> we've checked the entire buffer that will be used by the request - i.e.
> blk_rq_map_user_bvec should complete successfully. Therefore, terminate
> the loop early and return successfully when the check bytes + bv->bv_len
>> nr_iter succeeds.
> 
> While we're at it, also remove the check that all segments in the bvec
> are single-page. While this seems to be true for all users of the
> function, it doesn't appear to be required anywhere downstream.

Yep that looks like an issue. Since this would be nice to backport,
please add a Fixes tag as well.
Uday Shankar Oct. 23, 2024, 10:46 p.m. UTC | #2
On Wed, Oct 23, 2024 at 04:31:53PM -0600, Jens Axboe wrote:
> Yep that looks like an issue. Since this would be nice to backport,
> please add a Fixes tag as well.

Fixes: 37987547932c ("block: extend functionality to map bvec iterator")
Uday Shankar Oct. 23, 2024, 10:50 p.m. UTC | #3
On Wed, Oct 23, 2024 at 03:15:19PM -0600, Uday Shankar wrote:
> From: Xinyu Zhang <xizhang@purestorage.com>
> 
> blk_rq_map_user_bvec contains a check bytes + bv->bv_len > nr_iter which
> causes unnecessary failures in NVMe passthrough I/O, reproducible as
> follows:
> 
> - register a 2 page, page-aligned buffer against a ring
> - use that buffer to do a 1 page io_uring NVMe passthrough read
> 
> The second (i = 1) iteration of the loop in blk_rq_map_user_bvec will
> then have nr_iter == 1 page, bytes == 1 page, bv->bv_len == 1 page, so
> the check bytes + bv->bv_len > nr_iter will succeed, causing the I/O to
> fail. This failure is unnecessary, as when the check succeeds, it means
> we've checked the entire buffer that will be used by the request - i.e.
> blk_rq_map_user_bvec should complete successfully. Therefore, terminate
> the loop early and return successfully when the check bytes + bv->bv_len
> > nr_iter succeeds.

For anyone interested, here are the details on how to reproduce the
issue described above:

# cat test.c
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <sys/ioctl.h>
#include <liburing.h>
#include <stdlib.h>
#include <assert.h>
#include <linux/nvme_ioctl.h>

int main(int argc, char *argv[]) {
    struct io_uring ring;

    assert(io_uring_queue_init(1, &ring, IORING_SETUP_SQE128 | IORING_SETUP_CQE32) == 0);

    void *buf = memalign(4096, 2 * 4096);
    printf("buf %p\n", buf);
    struct iovec iov = {
        .iov_base = buf,
        .iov_len = 2 * 4096,
    };

    assert(io_uring_register_buffers(&ring, &iov, 1) == 0);

    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    assert(sqe != NULL);

    int fd = open("/dev/ng0n1", O_RDONLY);
    assert(fd > 0);
    sqe->fd = fd;
    sqe->opcode = IORING_OP_URING_CMD;
    sqe->cmd_op = NVME_URING_CMD_IO;
    sqe->buf_index = 0;
    sqe->flags = 0;
    sqe->uring_cmd_flags = IORING_URING_CMD_FIXED;

    struct nvme_passthru_cmd *cmd = &sqe->cmd;
    cmd->opcode = 2; // read
    cmd->nsid = 1;
    cmd->data_len = 1 * 4096;
    cmd->addr = buf;

    struct io_uring_cqe *cqe;
    assert(io_uring_submit(&ring) == 1);
    assert(io_uring_wait_cqe(&ring, &cqe) == 0);

    printf("res %d\n", cqe->res);

    return 0;
}
# gcc -o test -luring test.c
test.c: In function ‘main’:
test.c:15:17: warning: implicit declaration of function ‘memalign’ [-Wimplicit-function-declaration]
   15 |     void *buf = memalign(4096, 2 * 4096);
      |                 ^~~~~~~~
test.c:15:17: warning: initialization of ‘void *’ from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
test.c:36:37: warning: initialization of ‘struct nvme_passthru_cmd *’ from incompatible pointer type ‘__u8 (*)[0]’ {aka ‘unsigned char (*)[]’} [-Wincompatible-pointer-types]
   36 |     struct nvme_passthru_cmd *cmd = &sqe->cmd;
      |                                     ^
test.c:40:15: warning: assignment to ‘__u64’ {aka ‘long long unsigned int’} from ‘void *’ makes integer from pointer without a cast [-Wint-conversion]
   40 |     cmd->addr = buf;
      |
# ./test
buf 0x406000
res -22
Bart Van Assche Oct. 23, 2024, 10:54 p.m. UTC | #4
On 10/23/24 3:50 PM, Uday Shankar wrote:
> For anyone interested, here are the details on how to reproduce the
> issue described above:
> 
> # cat test.c
> #include <fcntl.h>
> #include <stdio.h>
> #include <string.h>
> #include <sys/ioctl.h>
> #include <liburing.h>
> #include <stdlib.h>
> #include <assert.h>
> #include <linux/nvme_ioctl.h>
> 
> int main(int argc, char *argv[]) {
>      struct io_uring ring;
> 
>      assert(io_uring_queue_init(1, &ring, IORING_SETUP_SQE128 | IORING_SETUP_CQE32) == 0);
> 
>      void *buf = memalign(4096, 2 * 4096);
>      printf("buf %p\n", buf);
>      struct iovec iov = {
>          .iov_base = buf,
>          .iov_len = 2 * 4096,
>      };
> 
>      assert(io_uring_register_buffers(&ring, &iov, 1) == 0);
> 
>      struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
>      assert(sqe != NULL);
> 
>      int fd = open("/dev/ng0n1", O_RDONLY);
>      assert(fd > 0);
>      sqe->fd = fd;
>      sqe->opcode = IORING_OP_URING_CMD;
>      sqe->cmd_op = NVME_URING_CMD_IO;
>      sqe->buf_index = 0;
>      sqe->flags = 0;
>      sqe->uring_cmd_flags = IORING_URING_CMD_FIXED;
> 
>      struct nvme_passthru_cmd *cmd = &sqe->cmd;
>      cmd->opcode = 2; // read
>      cmd->nsid = 1;
>      cmd->data_len = 1 * 4096;
>      cmd->addr = buf;
> 
>      struct io_uring_cqe *cqe;
>      assert(io_uring_submit(&ring) == 1);
>      assert(io_uring_wait_cqe(&ring, &cqe) == 0);
> 
>      printf("res %d\n", cqe->res);
> 
>      return 0;
> }
> # gcc -o test -luring test.c
> test.c: In function ‘main’:
> test.c:15:17: warning: implicit declaration of function ‘memalign’ [-Wimplicit-function-declaration]
>     15 |     void *buf = memalign(4096, 2 * 4096);
>        |                 ^~~~~~~~
> test.c:15:17: warning: initialization of ‘void *’ from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
> test.c:36:37: warning: initialization of ‘struct nvme_passthru_cmd *’ from incompatible pointer type ‘__u8 (*)[0]’ {aka ‘unsigned char (*)[]’} [-Wincompatible-pointer-types]
>     36 |     struct nvme_passthru_cmd *cmd = &sqe->cmd;
>        |                                     ^
> test.c:40:15: warning: assignment to ‘__u64’ {aka ‘long long unsigned int’} from ‘void *’ makes integer from pointer without a cast [-Wint-conversion]
>     40 |     cmd->addr = buf;
>        |
> # ./test
> buf 0x406000
> res -22

Please convert the above code into a blktests pull request. See also
https://github.com/osandov/blktests.

Thanks,

Bart.
Jens Axboe Oct. 23, 2024, 11:03 p.m. UTC | #5
On Wed, 23 Oct 2024 15:15:19 -0600, Uday Shankar wrote:
> blk_rq_map_user_bvec contains a check bytes + bv->bv_len > nr_iter which
> causes unnecessary failures in NVMe passthrough I/O, reproducible as
> follows:
> 
> - register a 2 page, page-aligned buffer against a ring
> - use that buffer to do a 1 page io_uring NVMe passthrough read
> 
> [...]

Applied, thanks!

[1/1] block: fix sanity checks in blk_rq_map_user_bvec
      commit: 2ff949441802a8d076d9013c7761f63e8ae5a9bd

Best regards,
Chaitanya Kulkarni Oct. 24, 2024, 12:42 a.m. UTC | #6
> Please convert the above code into a blktests pull request. See also
> https://github.com/osandov/blktests.
>
> Thanks,
>
> Bart.
>

+1 it will be really useful get this tested on regularly ...

-ck
diff mbox series

Patch

diff --git a/block/blk-map.c b/block/blk-map.c
index 0e1167b23934..6ef2ec1f7d78 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -600,9 +600,7 @@  static int blk_rq_map_user_bvec(struct request *rq, const struct iov_iter *iter)
 		if (nsegs >= nr_segs || bytes > UINT_MAX - bv->bv_len)
 			goto put_bio;
 		if (bytes + bv->bv_len > nr_iter)
-			goto put_bio;
-		if (bv->bv_offset + bv->bv_len > PAGE_SIZE)
-			goto put_bio;
+			break;
 
 		nsegs++;
 		bytes += bv->bv_len;