diff mbox

[1/1] nbd: increase maximum size of the PWRITE_ZERO request

Message ID 20180208132336.6519-1-edgar.kaziakhmedov@virtuozzo.com (mailing list archive)
State New, archived
Headers show

Commit Message

Edgar Kaziakhmedov Feb. 8, 2018, 1:23 p.m. UTC
Upstream NBD protocol implementation supports an efficient zero out
mechanism over the wire, along with the ability to check whether a
client allows using a hole.

Accordingly, since PWRITE_ZERO doesn't involve any payload on the wire,
increase a maximum size of the PWRITE_ZERO request up to 1Gb (aligned).
Moreover, such change will decrease the number of PWRITE_ZERO NBD commands
in comparison with the current 32M limit. The benefits of
the larger constraint can be examined in a block mirroring over NBD.

Signed-off-by: Edgar Kaziakhmedov <edgar.kaziakhmedov@virtuozzo.com>
---
 block/nbd.c         | 2 +-
 include/block/nbd.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

--
2.11.0

Comments

Eric Blake Feb. 8, 2018, 2:54 p.m. UTC | #1
On 02/08/2018 07:23 AM, Edgar Kaziakhmedov wrote:
> Upstream NBD protocol implementation supports an efficient zero out
> mechanism over the wire, along with the ability to check whether a
> client allows using a hole.
> 
> Accordingly, since PWRITE_ZERO doesn't involve any payload on the wire,
> increase a maximum size of the PWRITE_ZERO request up to 1Gb (aligned).
> Moreover, such change will decrease the number of PWRITE_ZERO NBD commands
> in comparison with the current 32M limit. The benefits of
> the larger constraint can be examined in a block mirroring over NBD.

We've got a potential problem.  Unless you have out-of-band 
communication of the maximum NBD_CMD_WRITE_ZEROES sizing (or if the NBD 
protocol is enhanced to advertise that as an additional piece of block 
size information during NBD_OPT_GO), then a client CANNOT assume that 
the server will accept a request this large.  We MIGHT get lucky if all 
existing servers that accept WRITE_ZEROES requests either act on large 
requests or reply with EINVAL but do not outright drop the connection 
(which is different from servers that DO outright drop the connection 
for an NBD_CMD_WRITE larger than 32M).  But I don't know if that's how 
all servers behave, so sending a too-large WRITE_ZEROES request may have 
the unintended consequence of killing the connection.

I'm adding the NBD list; perhaps before accepting this into qemu, I 
should revive my earlier attempt at codifying an NBD_OPT_GO info 
advertisement for maximum trim/zero sizing, which would let clients have 
a guarantee that their choice of sizing won't cause unexpected failures.

> 
> Signed-off-by: Edgar Kaziakhmedov <edgar.kaziakhmedov@virtuozzo.com>
> ---
>   block/nbd.c         | 2 +-
>   include/block/nbd.h | 3 +++
>   2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 94220f6d14..3641d9244e 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -477,7 +477,7 @@ static void nbd_refresh_limits(BlockDriverState *bs, Error **errp)
>       uint32_t max = MIN_NON_ZERO(NBD_MAX_BUFFER_SIZE, s->info.max_block);
> 
>       bs->bl.max_pdiscard = max;
> -    bs->bl.max_pwrite_zeroes = max;
> +    bs->bl.max_pwrite_zeroes = NBD_MAX_PWRITE_ZERO_SIZE;

max_discard should get the same treatment, whatever we decide (yes, it 
is feasible that a server may have different limits for NBD_CMD_TRIM 
than for NBD_CMD_WRITE_ZEROES, but in general, it's probably easier if 
both commands share theh same limit as both have very similar 
implementations).

>       bs->bl.max_transfer = max;
> 
>       if (s->info.opt_block &&
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index ee74ec391a..e2f18e2332 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -182,6 +182,9 @@ enum {
>   /* Maximum size of a single READ/WRITE data buffer */
>   #define NBD_MAX_BUFFER_SIZE (32 * 1024 * 1024)
> 
> +/* Maximum size of a single PWRITE_ZERO request 1Gb */
> +#define NBD_MAX_PWRITE_ZERO_SIZE (1024 * 1024 * 1024)
> +
>   /* Maximum size of an export name. The NBD spec requires 256 and
>    * suggests that servers support up to 4096, but we stick to only the
>    * required size so that we can stack-allocate the names, and because
> --
> 2.11.0
> 
>
Edgar Kaziakhmedov Feb. 8, 2018, 3:28 p.m. UTC | #2
On 02/08/2018 05:54 PM, Eric Blake wrote:
> On 02/08/2018 07:23 AM, Edgar Kaziakhmedov wrote:
>> Upstream NBD protocol implementation supports an efficient zero out
>> mechanism over the wire, along with the ability to check whether a
>> client allows using a hole.
>>
>> Accordingly, since PWRITE_ZERO doesn't involve any payload on the wire,
>> increase a maximum size of the PWRITE_ZERO request up to 1Gb (aligned).
>> Moreover, such change will decrease the number of PWRITE_ZERO NBD 
>> commands
>> in comparison with the current 32M limit. The benefits of
>> the larger constraint can be examined in a block mirroring over NBD.
>
> We've got a potential problem.  Unless you have out-of-band 
> communication of the maximum NBD_CMD_WRITE_ZEROES sizing (or if the 
> NBD protocol is enhanced to advertise that as an additional piece of 
> block size information during NBD_OPT_GO), then a client CANNOT assume 
> that the server will accept a request this large. We MIGHT get lucky 
> if all existing servers that accept WRITE_ZEROES requests either act 
> on large requests or reply with EINVAL but do not outright drop the 
> connection (which is different from servers that DO outright drop the 
> connection for an NBD_CMD_WRITE larger than 32M).  But I don't know if 
> that's how all servers behave, so sending a too-large WRITE_ZEROES 
> request may have the unintended consequence of killing the connection.
Actually, I do not understand why current NBD servers shouldn't accept 
such large requests, because most servers should apply some 
optimizations avoiding direct filling with zeroes.
As for block-mirroring over NBD, it works fine with QEMU server 
implementation and isn't it the main application?
>
> I'm adding the NBD list; perhaps before accepting this into qemu, I 
> should revive my earlier attempt at codifying an NBD_OPT_GO info 
> advertisement for maximum trim/zero sizing, which would let clients 
> have a guarantee that their choice of sizing won't cause unexpected 
> failures.
>
>>
>> Signed-off-by: Edgar Kaziakhmedov <edgar.kaziakhmedov@virtuozzo.com>
>> ---
>>   block/nbd.c         | 2 +-
>>   include/block/nbd.h | 3 +++
>>   2 files changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/nbd.c b/block/nbd.c
>> index 94220f6d14..3641d9244e 100644
>> --- a/block/nbd.c
>> +++ b/block/nbd.c
>> @@ -477,7 +477,7 @@ static void nbd_refresh_limits(BlockDriverState 
>> *bs, Error **errp)
>>       uint32_t max = MIN_NON_ZERO(NBD_MAX_BUFFER_SIZE, 
>> s->info.max_block);
>>
>>       bs->bl.max_pdiscard = max;
>> -    bs->bl.max_pwrite_zeroes = max;
>> +    bs->bl.max_pwrite_zeroes = NBD_MAX_PWRITE_ZERO_SIZE;
>
> max_discard should get the same treatment, whatever we decide (yes, it 
> is feasible that a server may have different limits for NBD_CMD_TRIM 
> than for NBD_CMD_WRITE_ZEROES, but in general, it's probably easier if 
> both commands share theh same limit as both have very similar 
> implementations).
Sounds reasonable.
>
>>       bs->bl.max_transfer = max;
>>
>>       if (s->info.opt_block &&
>> diff --git a/include/block/nbd.h b/include/block/nbd.h
>> index ee74ec391a..e2f18e2332 100644
>> --- a/include/block/nbd.h
>> +++ b/include/block/nbd.h
>> @@ -182,6 +182,9 @@ enum {
>>   /* Maximum size of a single READ/WRITE data buffer */
>>   #define NBD_MAX_BUFFER_SIZE (32 * 1024 * 1024)
>>
>> +/* Maximum size of a single PWRITE_ZERO request 1Gb */
>> +#define NBD_MAX_PWRITE_ZERO_SIZE (1024 * 1024 * 1024)
>> +
>>   /* Maximum size of an export name. The NBD spec requires 256 and
>>    * suggests that servers support up to 4096, but we stick to only the
>>    * required size so that we can stack-allocate the names, and because
>> -- 
>> 2.11.0
>>
>>
>
Eric Blake Feb. 8, 2018, 3:28 p.m. UTC | #3
[resending with correct NBD list address]

On 02/08/2018 07:23 AM, Edgar Kaziakhmedov wrote:
> Upstream NBD protocol implementation supports an efficient zero out
> mechanism over the wire, along with the ability to check whether a
> client allows using a hole.
> 
> Accordingly, since PWRITE_ZERO doesn't involve any payload on the wire,
> increase a maximum size of the PWRITE_ZERO request up to 1Gb (aligned).
> Moreover, such change will decrease the number of PWRITE_ZERO NBD commands
> in comparison with the current 32M limit. The benefits of
> the larger constraint can be examined in a block mirroring over NBD.

We've got a potential problem.  Unless you have out-of-band 
communication of the maximum NBD_CMD_WRITE_ZEROES sizing (or if the NBD 
protocol is enhanced to advertise that as an additional piece of block 
size information during NBD_OPT_GO), then a client CANNOT assume that 
the server will accept a request this large.  We MIGHT get lucky if all 
existing servers that accept WRITE_ZEROES requests either act on large 
requests or reply with EINVAL but do not outright drop the connection 
(which is different from servers that DO outright drop the connection 
for an NBD_CMD_WRITE larger than 32M).  But I don't know if that's how 
all servers behave, so sending a too-large WRITE_ZEROES request may have 
the unintended consequence of killing the connection.

I'm adding the NBD list; perhaps before accepting this into qemu, I 
should revive my earlier attempt at codifying an NBD_OPT_GO info 
advertisement for maximum trim/zero sizing, which would let clients have 
a guarantee that their choice of sizing won't cause unexpected failures.

> 
> Signed-off-by: Edgar Kaziakhmedov <edgar.kaziakhmedov@virtuozzo.com>
> ---
>   block/nbd.c         | 2 +-
>   include/block/nbd.h | 3 +++
>   2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 94220f6d14..3641d9244e 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -477,7 +477,7 @@ static void nbd_refresh_limits(BlockDriverState *bs, Error **errp)
>       uint32_t max = MIN_NON_ZERO(NBD_MAX_BUFFER_SIZE, s->info.max_block);
> 
>       bs->bl.max_pdiscard = max;
> -    bs->bl.max_pwrite_zeroes = max;
> +    bs->bl.max_pwrite_zeroes = NBD_MAX_PWRITE_ZERO_SIZE;

max_discard should get the same treatment, whatever we decide (yes, it 
is feasible that a server may have different limits for NBD_CMD_TRIM 
than for NBD_CMD_WRITE_ZEROES, but in general, it's probably easier if 
both commands share theh same limit as both have very similar 
implementations).

>       bs->bl.max_transfer = max;
> 
>       if (s->info.opt_block &&
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index ee74ec391a..e2f18e2332 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -182,6 +182,9 @@ enum {
>   /* Maximum size of a single READ/WRITE data buffer */
>   #define NBD_MAX_BUFFER_SIZE (32 * 1024 * 1024)
> 
> +/* Maximum size of a single PWRITE_ZERO request 1Gb */
> +#define NBD_MAX_PWRITE_ZERO_SIZE (1024 * 1024 * 1024)
> +
>   /* Maximum size of an export name. The NBD spec requires 256 and
>    * suggests that servers support up to 4096, but we stick to only the
>    * required size so that we can stack-allocate the names, and because
> --
> 2.11.0
> 
>
Eric Blake Feb. 8, 2018, 3:55 p.m. UTC | #4
On 02/08/2018 09:28 AM, Edgar Kaziakhmedov wrote:

>> We've got a potential problem.  Unless you have out-of-band 
>> communication of the maximum NBD_CMD_WRITE_ZEROES sizing (or if the 
>> NBD protocol is enhanced to advertise that as an additional piece of 
>> block size information during NBD_OPT_GO), then a client CANNOT assume 
>> that the server will accept a request this large. We MIGHT get lucky 
>> if all existing servers that accept WRITE_ZEROES requests either act 
>> on large requests or reply with EINVAL but do not outright drop the 
>> connection (which is different from servers that DO outright drop the 
>> connection for an NBD_CMD_WRITE larger than 32M).  But I don't know if 
>> that's how all servers behave, so sending a too-large WRITE_ZEROES 
>> request may have the unintended consequence of killing the connection.
> Actually, I do not understand why current NBD servers shouldn't accept 
> such large requests, because most servers should apply some 
> optimizations avoiding direct filling with zeroes.

Just because a server CAN optimize doesn't mean that it is REQUIRED to 
optimize.  You cannot make assumptions that a server will be happy with 
a larger request, merely because less data was sent over the wire, 
because the server may still have to allocate memory locally to perform 
the request.

> As for block-mirroring over NBD, it works fine with QEMU server 
> implementation and isn't it the main application?

Yes, qemu-to-qemu interoperating as efficiently as possible is nice; but 
I'm worried about qemu-to-other interoperating as well.  The point of a 
public specification is to avoid one-way silos, so that you CAN 
mix-and-match a server from one implementation with a client from 
another, rather than being forced to use qemu as the server when qemu is 
the client.  Note that portability can include hand-shaking to fall back 
to the least-common denominator, rather than requiring both sides to 
always understand all extensions; but the important part is that neither 
party should make assumptions about the other side without using the 
spec as their guide.
Edgar Kaziakhmedov Feb. 8, 2018, 4:01 p.m. UTC | #5
On 02/08/2018 06:55 PM, Eric Blake wrote:
> On 02/08/2018 09:28 AM, Edgar Kaziakhmedov wrote:
>
>>> We've got a potential problem.  Unless you have out-of-band 
>>> communication of the maximum NBD_CMD_WRITE_ZEROES sizing (or if the 
>>> NBD protocol is enhanced to advertise that as an additional piece of 
>>> block size information during NBD_OPT_GO), then a client CANNOT 
>>> assume that the server will accept a request this large. We MIGHT 
>>> get lucky if all existing servers that accept WRITE_ZEROES requests 
>>> either act on large requests or reply with EINVAL but do not 
>>> outright drop the connection (which is different from servers that 
>>> DO outright drop the connection for an NBD_CMD_WRITE larger than 
>>> 32M).  But I don't know if that's how all servers behave, so sending 
>>> a too-large WRITE_ZEROES request may have the unintended consequence 
>>> of killing the connection.
>> Actually, I do not understand why current NBD servers shouldn't 
>> accept such large requests, because most servers should apply some 
>> optimizations avoiding direct filling with zeroes.
>
> Just because a server CAN optimize doesn't mean that it is REQUIRED to 
> optimize.  You cannot make assumptions that a server will be happy 
> with a larger request, merely because less data was sent over the 
> wire, because the server may still have to allocate memory locally to 
> perform the request.
>
>> As for block-mirroring over NBD, it works fine with QEMU server 
>> implementation and isn't it the main application?
>
> Yes, qemu-to-qemu interoperating as efficiently as possible is nice; 
> but I'm worried about qemu-to-other interoperating as well. The point 
> of a public specification is to avoid one-way silos, so that you CAN 
> mix-and-match a server from one implementation with a client from 
> another, rather than being forced to use qemu as the server when qemu 
> is the client.  Note that portability can include hand-shaking to fall 
> back to the least-common denominator, rather than requiring both sides 
> to always understand all extensions; but the important part is that 
> neither party should make assumptions about the other side without 
> using the spec as their guide.
So, in that case it is required to negotiate about the biggest 
write_zero chunk size before communication, if such option is featured 
in mainline NBD.
Alex Bligh Feb. 10, 2018, 5:43 p.m. UTC | #6
> On 8 Feb 2018, at 16:28, Eric Blake <eblake@redhat.com> wrote:
> 
> On 02/08/2018 07:23 AM, Edgar Kaziakhmedov wrote:
>> Upstream NBD protocol implementation supports an efficient zero out
>> mechanism over the wire, along with the ability to check whether a
>> client allows using a hole.
>> Accordingly, since PWRITE_ZERO doesn't involve any payload on the wire,
>> increase a maximum size of the PWRITE_ZERO request up to 1Gb (aligned).
>> Moreover, such change will decrease the number of PWRITE_ZERO NBD commands
>> in comparison with the current 32M limit. The benefits of
>> the larger constraint can be examined in a block mirroring over NBD.
> 
> We've got a potential problem.  Unless you have out-of-band communication of the maximum NBD_CMD_WRITE_ZEROES sizing (or if the NBD protocol is enhanced to advertise that as an additional piece of block size information during NBD_OPT_GO), then a client CANNOT assume that the server will accept a request this large.  We MIGHT get lucky if all existing servers that accept WRITE_ZEROES requests either act on large requests or reply with EINVAL but do not outright drop the connection (which is different from servers that DO outright drop the connection for an NBD_CMD_WRITE larger than 32M).  But I don't know if that's how all servers behave, so sending a too-large WRITE_ZEROES request may have the unintended consequence of killing the connection.
> 
> I'm adding the NBD list; perhaps before accepting this into qemu, I should revive my earlier attempt at codifying an NBD_OPT_GO info advertisement for maximum trim/zero sizing, which would let clients have a guarantee that their choice of sizing won't cause unexpected failures.

A couple of comments:

1. The length field is only 32 bits, so no writes more than 0xffffffff in length are going to work anyway :-)

2. I'm not sure the situation is as bad as you make out Eric. I think you've forgotten the NBD_OPT_INFO work and the conversation around that we had where we determined that servers not supporting NBD_OPT_INFO were already meant to support 'unlimited' size writes. Per the spec:

"If block size constraints have not been advertised or agreed on externally, then a client SHOULD assume a default minimum block size of 1, a preferred block size of 2^12 (4,096), and a maximum block size of the smaller of the export size or 0xffffffff (effectively unlimited)."

I read these to apply to all uses of 'length', but even if one argues it doesn't apply to NBD_CMD_WRITE_ZEROES because it doesn't have a payload, I think the rebuttal is that a server which supports NBD_CMD_WRITE of a given length must also support NBD_CMD_WRITE_ZEROES of that length.

So I think a reasonable logic for Qemu would be to try NBD_CMD_INFO and find the maximum write size, and if that's unsupported use 0xffffffff (capping at export size, or export size minus write offset).
Alex Bligh Feb. 10, 2018, 5:45 p.m. UTC | #7
> On 10 Feb 2018, at 18:43, Alex Bligh <alex@alex.org.uk> wrote:
> 
> So I think a reasonable logic for Qemu would be to try NBD_CMD_INFO and find the maximum write size, and if that's unsupported use 0xffffffff (capping at export size, or export size minus write offset).

Ur actually capping it at (2^16 - blocksize) would be the right thing to do (writes should be multiples of the block size).
Eric Blake May 1, 2018, 9:16 p.m. UTC | #8
On 02/08/2018 07:23 AM, Edgar Kaziakhmedov wrote:
> Upstream NBD protocol implementation supports an efficient zero out
> mechanism over the wire, along with the ability to check whether a
> client allows using a hole.
> 
> Accordingly, since PWRITE_ZERO doesn't involve any payload on the wire,
> increase a maximum size of the PWRITE_ZERO request up to 1Gb (aligned).
> Moreover, such change will decrease the number of PWRITE_ZERO NBD commands
> in comparison with the current 32M limit. The benefits of
> the larger constraint can be examined in a block mirroring over NBD.
> 
> Signed-off-by: Edgar Kaziakhmedov <edgar.kaziakhmedov@virtuozzo.com>
> ---
>   block/nbd.c         | 2 +-
>   include/block/nbd.h | 3 +++
>   2 files changed, 4 insertions(+), 1 deletion(-)

I've posted a counter-proposal series that should accomplish the same 
effect, but in a way (soon to be) sanctioned by the NBD spec:

https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg00135.html
diff mbox

Patch

diff --git a/block/nbd.c b/block/nbd.c
index 94220f6d14..3641d9244e 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -477,7 +477,7 @@  static void nbd_refresh_limits(BlockDriverState *bs, Error **errp)
     uint32_t max = MIN_NON_ZERO(NBD_MAX_BUFFER_SIZE, s->info.max_block);

     bs->bl.max_pdiscard = max;
-    bs->bl.max_pwrite_zeroes = max;
+    bs->bl.max_pwrite_zeroes = NBD_MAX_PWRITE_ZERO_SIZE;
     bs->bl.max_transfer = max;

     if (s->info.opt_block &&
diff --git a/include/block/nbd.h b/include/block/nbd.h
index ee74ec391a..e2f18e2332 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -182,6 +182,9 @@  enum {
 /* Maximum size of a single READ/WRITE data buffer */
 #define NBD_MAX_BUFFER_SIZE (32 * 1024 * 1024)

+/* Maximum size of a single PWRITE_ZERO request 1Gb */
+#define NBD_MAX_PWRITE_ZERO_SIZE (1024 * 1024 * 1024)
+
 /* Maximum size of an export name. The NBD spec requires 256 and
  * suggests that servers support up to 4096, but we stick to only the
  * required size so that we can stack-allocate the names, and because