diff mbox

block: split large discard requests from block frontend

Message ID 1459513321-3776-1-git-send-email-olaf@aepfle.de (mailing list archive)
State New, archived
Headers show

Commit Message

Olaf Hering April 1, 2016, 12:22 p.m. UTC
Large discard requests lead to sign expansion errors in qemu.
Since there is no API to tell a guest about the limitations qmeu
has to split a large request itself.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

Comments

Max Reitz April 1, 2016, 5:19 p.m. UTC | #1
On 01.04.2016 14:22, Olaf Hering wrote:
> Large discard requests lead to sign expansion errors in qemu.
> Since there is no API to tell a guest about the limitations qmeu
> has to split a large request itself.
> 
> Signed-off-by: Olaf Hering <olaf@aepfle.de>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Cc: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/io.c | 22 +++++++++++++++++++++-
>  1 file changed, 21 insertions(+), 1 deletion(-)

Hi!

Let me first nag about some style problems, and then I'll come to the
technical/design aspect. :-)

> diff --git a/block/io.c b/block/io.c
> index c4869b9..5b6ed58 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2442,7 +2442,7 @@ static void coroutine_fn bdrv_discard_co_entry(void *opaque)
>      rwco->ret = bdrv_co_discard(rwco->bs, rwco->sector_num, rwco->nb_sectors);
>  }
>  
> -int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
> +static int __bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,

Two things: First, I think we do not like identifiers to start with
underscores, especially not with two underscores or with underscore and
a capital letter, because in C those combinations are generally reserved
for compiler/language extensions or for the operating system (think of
__attribute__(), __asm__(), or _Bool).

Second, the coroutine_fn should stay. This is just an empty macro (I
think) that signifies that a certain function is run inside of a
coroutine. That fact will not change if you put a wrapper around it.

>                                   int nb_sectors)

This should be aligned to the opening paranthesis of the line above.

>  {
>      BdrvTrackedRequest req;
> @@ -2524,6 +2524,26 @@ out:
>      return ret;
>  }
>  
> +int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
> +                                 int nb_sectors)
> +{
> +    int num, ret;
> +    int limit = BDRV_REQUEST_MAX_SECTORS;
> +    int remaining = nb_sectors;
> +    int64_t sector_offset = sector_num;
> +
> +    do {
> +        num = remaining > limit ? limit : remaining;

num = MIN(limit, remaining); would work just as fine and maybe look a
bit better.

> +        ret = __bdrv_co_discard(bs, sector_offset, num);
> +        if (ret < 0)
> +            break;

In qemu, every block is enclosed by {} braces, even if it contains only
a single statement.

This is something that the checkpatch script (scripts/checkpatch.pl in
the qemu tree) can detect. It is advisible to run that scripts on
patches before they are sent to the mailing list.

> +        remaining -= num;
> +        sector_offset += num;
> +    } while (remaining > 0);
> +
> +    return ret;
> +}
> +
>  int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
>  {
>      Coroutine *co;
> 

So, now about the technical aspect:

Guest requests are not issued to BlockDriverStates directly, they pass
through a BlockBackend first. The check whether the request is too large
is done there (in blk_check_request() called by blk_co_discard() and
blk_aio_discard()).

Thus, if a discard request exceeds BDRV_REQUEST_MAX_SECTORS, it will be
denied at the BlockBackend level and not reach bdrv_co_discard() at all.
At least that is how it's supposed to be. If it isn't, that's a bug. ;-)

Finally, I'm not sure whether this is something that should be done in
the block layer. I personally feel like it is the device models'
responsibility to only submit requests that adhere to the limits
established by the block layer.

In any case, do you have a test case where a guest was able to submit a
request that led to the overflow error you described in the commit message?

Max
Olaf Hering April 1, 2016, 5:49 p.m. UTC | #2
On Fri, Apr 01, Max Reitz wrote:

> In any case, do you have a test case where a guest was able to submit a
> request that led to the overflow error you described in the commit message?

mkfs -t ext4 /dev/sdb1 in a xen guest with qcow2 as backing device.
When I added discard support to libxl I worked with raw images, so I did
not notice this. Not sure why it happens to work in kvm guests. I assume
the frontend driver just works around the qemu bug by limiting its
request size.

Olaf
Max Reitz May 6, 2016, 4:44 p.m. UTC | #3
On 01.04.2016 19:49, Olaf Hering wrote:
> On Fri, Apr 01, Max Reitz wrote:
> 
>> In any case, do you have a test case where a guest was able to submit a
>> request that led to the overflow error you described in the commit message?
> 
> mkfs -t ext4 /dev/sdb1 in a xen guest with qcow2 as backing device.
> When I added discard support to libxl I worked with raw images, so I did
> not notice this. Not sure why it happens to work in kvm guests. I assume
> the frontend driver just works around the qemu bug by limiting its
> request size.

Sorry for not having replied in so long.

I know next to nothing about Xen, but I'm very much inclined to think
the Xen block driver (hw/block/xen_disk.c) is at fault here. The
blkif_request_discard structure it uses for accepting discard requests
apparently has a uint64_t nr_sectors field.

So I think using the Xen block device, it is possible for a guest to
submit discard requests with more then INT_MAX sectors involved, and
it's the Xen's block device emulation code job to split those requests
into smaller ones.

That said, I'm not sure this is ideal. It doesn't really look like other
block drivers care so much about splitting requests either. So we're a
bit in a pickle there.

Anyway, for your issue at hand I think there is a simpler solution.
bdrv_co_discard() can and already does split requests. So if we remove
the calls to blk_check_request() in blk_discard() and
bdrv_check_request() in bdrv_co_discard(), everything should already
work just fine.

Therefore, I think what could work for now is for blk_discard() and
bdrv_co_discard() to modify the checks they are doing on the incoming
request.

In theory, all we need to do is skip the length check for both, i.e.
accept requests even if they are longer than INT_MAX / BDRV_SECTOR_SIZE
sectors. I'm not sure how we can do that nicely in practice, though.

Max
Olaf Hering Nov. 17, 2016, 1:54 p.m. UTC | #4
On Fri, May 06, Max Reitz wrote:

> On 01.04.2016 19:49, Olaf Hering wrote:
> > On Fri, Apr 01, Max Reitz wrote:
> > 
> >> In any case, do you have a test case where a guest was able to submit a
> >> request that led to the overflow error you described in the commit message?
> > 
> > mkfs -t ext4 /dev/sdb1 in a xen guest with qcow2 as backing device.
> > When I added discard support to libxl I worked with raw images, so I did
> > not notice this. Not sure why it happens to work in kvm guests. I assume
> > the frontend driver just works around the qemu bug by limiting its
> > request size.
> 
> Sorry for not having replied in so long.
> 
> I know next to nothing about Xen, but I'm very much inclined to think
> the Xen block driver (hw/block/xen_disk.c) is at fault here. The
> blkif_request_discard structure it uses for accepting discard requests
> apparently has a uint64_t nr_sectors field.

Thanks for the pointer.

Looking at current master, BLKIF_OP_DISCARD is indeed broken. The values
passed from the guest are u64 and get stashed into signed values. I will
add a loop to repeatedly call blk_aio_pdiscard with small chunks of
BDRV_REQUEST_MAX_SECTORS.

We quickly checked other users of blk_aio_pdiscard and it appears they
are not affected because they notify the guest abuilt the limits.

Olaf
diff mbox

Patch

diff --git a/block/io.c b/block/io.c
index c4869b9..5b6ed58 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2442,7 +2442,7 @@  static void coroutine_fn bdrv_discard_co_entry(void *opaque)
     rwco->ret = bdrv_co_discard(rwco->bs, rwco->sector_num, rwco->nb_sectors);
 }
 
-int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
+static int __bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
                                  int nb_sectors)
 {
     BdrvTrackedRequest req;
@@ -2524,6 +2524,26 @@  out:
     return ret;
 }
 
+int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
+                                 int nb_sectors)
+{
+    int num, ret;
+    int limit = BDRV_REQUEST_MAX_SECTORS;
+    int remaining = nb_sectors;
+    int64_t sector_offset = sector_num;
+
+    do {
+        num = remaining > limit ? limit : remaining;
+        ret = __bdrv_co_discard(bs, sector_offset, num);
+        if (ret < 0)
+            break;
+        remaining -= num;
+        sector_offset += num;
+    } while (remaining > 0);
+
+    return ret;
+}
+
 int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
 {
     Coroutine *co;