diff mbox

[v4,7/8] nbd: Implement NBD_INFO_BLOCK_SIZE on server

Message ID 20170221024248.11027-8-eblake@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Eric Blake Feb. 21, 2017, 2:42 a.m. UTC
The upstream NBD Protocol has defined a new extension to allow
the server to advertise block sizes to the client, as well as
a way for the client to inform the server that it intends to
obey block sizes.

Thanks to a recent fix (commit df7b97ff), our real minimum
transfer size is always 1 (the block layer takes care of
read-modify-write on our behalf), but we're still more efficient
if we advertise 512 when the client supports it, as follows:
- OPT_INFO, but no NBD_INFO_BLOCK_SIZE: advertise 512, then
fail with NBD_REP_ERR_BLOCK_SIZE_REQD; client is free to try
something else since we don't disconnect
- OPT_INFO with NBD_INFO_BLOCK_SIZE: advertise 512
- OPT_GO, but no NBD_INFO_BLOCK_SIZE: advertise 1
- OPT_GO with NBD_INFO_BLOCK_SIZE: advertise 512

We can also advertise the optimum block size (presumably the
cluster size, when exporting a qcow2 file), and our absolute
maximum transfer size of 32M, to help newer clients avoid
EINVAL failures or abrupt disconnects on oversize requests.

We do not reject clients for using the older NBD_OPT_EXPORT_NAME;
we are no worse off for those clients than we used to be.

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v4: new patch
---
 nbd/server.c | 47 +++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 43 insertions(+), 4 deletions(-)

Comments

Paolo Bonzini Feb. 22, 2017, 4:51 p.m. UTC | #1
On 21/02/2017 03:42, Eric Blake wrote:
> +    /* Send NBD_INFO_BLOCK_SIZE always, but tweak the minimum size
> +     * according to whether the client requested it, and according to
> +     * whether this is OPT_INFO or OPT_GO. */
> +    /* minimum - 1 for back-compat, or 512 if client is new enough.
> +     * TODO: consult blk_bs(blk)->request_align? */
> +    sizes[0] = (opt == NBD_OPT_INFO || blocksize) ? BDRV_SECTOR_SIZE : 1;
> +    /* preferred - At least 4096, but larger as appropriate. */
> +    sizes[1] = MAX(blk_get_opt_transfer(exp->blk), 4096);

Can we just say zero if the preferred transfer size is unknown?

Apart from this, it looks good.

Paolo

> +    /* maximum - At most 32M, but smaller as appropriate. */
> +    sizes[2] = MIN(blk_get_max_transfer(exp->blk), NBD_MAX_BUFFER_SIZE);
> +    TRACE("advertising minimum 0x%" PRIx32 ", preferred 0x%" PRIx32
> +          ", maximum 0x%" PRIx32, sizes[0], sizes[1], sizes[2]);
Eric Blake Feb. 22, 2017, 5:11 p.m. UTC | #2
On 02/22/2017 10:51 AM, Paolo Bonzini wrote:
> 
> 
> On 21/02/2017 03:42, Eric Blake wrote:
>> +    /* Send NBD_INFO_BLOCK_SIZE always, but tweak the minimum size
>> +     * according to whether the client requested it, and according to
>> +     * whether this is OPT_INFO or OPT_GO. */
>> +    /* minimum - 1 for back-compat, or 512 if client is new enough.
>> +     * TODO: consult blk_bs(blk)->request_align? */
>> +    sizes[0] = (opt == NBD_OPT_INFO || blocksize) ? BDRV_SECTOR_SIZE : 1;
>> +    /* preferred - At least 4096, but larger as appropriate. */
>> +    sizes[1] = MAX(blk_get_opt_transfer(exp->blk), 4096);
> 
> Can we just say zero if the preferred transfer size is unknown?

The NBD specification requires a non-zero power-of-2 number if the
server transmits the block size at all; 1 is the ideal number, followed
by whatever actual size we learn from the request_align of the device.


I added blk_get_opt_transfer() in 3/8; maybe I should just respin the
series to also add blk_get_request_align() as well and get rid of the
TODO here.  In other words, my TODO is probably the result of me not
remembering to complete my rebase (since v3 of the series was posted so
many months ago).

> 
> Apart from this, it looks good.

Thanks; I'll get a respin out soon, with R-b added where they make
sense, and with the TODO removed.
Paolo Bonzini Feb. 22, 2017, 5:26 p.m. UTC | #3
On 22/02/2017 18:11, Eric Blake wrote:
>>> +    /* preferred - At least 4096, but larger as appropriate. */
>>> +    sizes[1] = MAX(blk_get_opt_transfer(exp->blk), 4096);
>
> The NBD specification requires a non-zero power-of-2 number if the
> server transmits the block size at all; 1 is the ideal number, followed
> by whatever actual size we learn from the request_align of the device.

Oh, so it's the smallest "good" transfer size, or the preferred
alignment.  That's not the same as the SCSI definition, which is:

   If a device server receives one of these commands with a transfer
   size greater than this value, then the device server may incur
   delays in processing the command. An OPTIMAL TRANSFER LENGTH field
   set to 0000_0000h indicates that the device server does not report
   an optimal transfer size.

It's more similar to the physical block size:

   When using logical block access commands (see 4.2.2), application
   clients should:
   a) specify an LBA that is aligned to a physical block boundary; and
   b) access an integral number of physical blocks, provided that the
   access does not go beyond the last LBA on the medium.

So I'd rather ignore it in the client, and send 4096 in the server.

Paolo
Eric Blake Feb. 22, 2017, 8:34 p.m. UTC | #4
On 02/22/2017 11:26 AM, Paolo Bonzini wrote:
> 
> 
> On 22/02/2017 18:11, Eric Blake wrote:
>>>> +    /* preferred - At least 4096, but larger as appropriate. */
>>>> +    sizes[1] = MAX(blk_get_opt_transfer(exp->blk), 4096);
>>
>> The NBD specification requires a non-zero power-of-2 number if the
>> server transmits the block size at all; 1 is the ideal number, followed
>> by whatever actual size we learn from the request_align of the device.

Oh shoot - now I notice I misread your complaint - I thought you were
complaining about sizes[0] (min_size) having a TODO comment; but you
were talking about sizes[1] (preferred_size).

The NBD spec wording can be changed if needed (after all, it is still
experimental), but it currently says:

"If block size constraints have not been advertised or agreed on
externally, then a client SHOULD assume a default minimum block size of
1, a preferred block size of 2^12 (4,096), and a maximum block size of
the smaller of the export size or 0xffffffff (effectively unlimited).

...

"The preferred block size represents the minimum size at which aligned
requests will have efficient I/O, avoiding behaviour such as
read-modify-write. If advertised, this MUST be a power of 2 at least as
large as the smaller of the minimum block size and 2^12 (4,096),
although larger values (such as the minimum granularity of a hole) are
also appropriate. The preferred block size MAY be larger than the export
size, in which case the client is unable to utilize the preferred block
size for that export. The server MAY advertise an export size that is
not an integer multiple of the preferred block size."

> 
> Oh, so it's the smallest "good" transfer size, or the preferred
> alignment.  That's not the same as the SCSI definition, which is:
> 
>    If a device server receives one of these commands with a transfer
>    size greater than this value, then the device server may incur
>    delays in processing the command. An OPTIMAL TRANSFER LENGTH field
>    set to 0000_0000h indicates that the device server does not report
>    an optimal transfer size.

Hmm - that's yet another limit. I don't know if our block layer exposes
it, or if it should expose it.

> 
> It's more similar to the physical block size:

Indeed; at least that was my intent (picking a size that will avoid
read-modify-write pessimations, as well as reflecting granularity of
trim/zero operations).

> 
>    When using logical block access commands (see 4.2.2), application
>    clients should:
>    a) specify an LBA that is aligned to a physical block boundary; and
>    b) access an integral number of physical blocks, provided that the
>    access does not go beyond the last LBA on the medium.
> 
> So I'd rather ignore it in the client, and send 4096 in the server.

Does that mean our BlockLimits structure documentation needs a tweak,
too?  It currently reads:

    /* Optimal transfer length in bytes.  A power of 2 is best but not
     * mandatory.  Must be a multiple of bl.request_alignment, or 0 if
     * no preferred size */
    uint32_t opt_transfer;

Are we trying to track both optimum size in the SCSI sense _and_ block
size in the O_DIRECT sense?
Paolo Bonzini Feb. 23, 2017, 1:13 p.m. UTC | #5
On 22/02/2017 21:34, Eric Blake wrote:
> 
>> Oh, so it's the smallest "good" transfer size, or the preferred
>> alignment.  That's not the same as the SCSI definition, which is:
>>
>>    If a device server receives one of these commands with a transfer
>>    size greater than this value, then the device server may incur
>>    delays in processing the command. An OPTIMAL TRANSFER LENGTH field
>>    set to 0000_0000h indicates that the device server does not report
>>    an optimal transfer size.
> Hmm - that's yet another limit. I don't know if our block layer exposes
> it, or if it should expose it.

It exposes it in opt_transfer. :)  The only driver that sets
opt_transfer is the iSCSI driver and uses the SCSI meaning.

>> It's more similar to the physical block size:
>
> Indeed; at least that was my intent (picking a size that will avoid
> read-modify-write pessimations, as well as reflecting granularity of
> trim/zero operations).

The two are not necessarily related.  Granularity of trim/zero is
usually a multiple of the sector size (for example qcow2 will have
physical block size = host physical block size, and unmap granularity =
cluster size).

>>    When using logical block access commands (see 4.2.2), application
>>    clients should:
>>    a) specify an LBA that is aligned to a physical block boundary; and
>>    b) access an integral number of physical blocks, provided that the
>>    access does not go beyond the last LBA on the medium.
>>
>> So I'd rather ignore it in the client, and send 4096 in the server.
> Does that mean our BlockLimits structure documentation needs a tweak,
> too?  It currently reads:
> 
>     /* Optimal transfer length in bytes.  A power of 2 is best but not
>      * mandatory.  Must be a multiple of bl.request_alignment, or 0 if
>      * no preferred size */
>     uint32_t opt_transfer;
> 
> Are we trying to track both optimum size in the SCSI sense _and_ block
> size in the O_DIRECT sense?

As of now, just in the SCSI sense.

Paolo
Eric Blake Feb. 23, 2017, 3:38 p.m. UTC | #6
On 02/23/2017 07:13 AM, Paolo Bonzini wrote:
> 
> 
> On 22/02/2017 21:34, Eric Blake wrote:
>>
>>> Oh, so it's the smallest "good" transfer size, or the preferred
>>> alignment.  That's not the same as the SCSI definition, which is:
>>>
>>>    If a device server receives one of these commands with a transfer
>>>    size greater than this value, then the device server may incur
>>>    delays in processing the command. An OPTIMAL TRANSFER LENGTH field
>>>    set to 0000_0000h indicates that the device server does not report
>>>    an optimal transfer size.
>> Hmm - that's yet another limit. I don't know if our block layer exposes
>> it, or if it should expose it.
> 
> It exposes it in opt_transfer. :)  The only driver that sets
> opt_transfer is the iSCSI driver and uses the SCSI meaning.

>> Does that mean our BlockLimits structure documentation needs a tweak,
>> too?  It currently reads:
>>
>>     /* Optimal transfer length in bytes.  A power of 2 is best but not
>>      * mandatory.  Must be a multiple of bl.request_alignment, or 0 if
>>      * no preferred size */
>>     uint32_t opt_transfer;
>>
>> Are we trying to track both optimum size in the SCSI sense _and_ block
>> size in the O_DIRECT sense?
> 
> As of now, just in the SCSI sense.

Okay, sounds like I need a pre-requisite patch to beef up the
BlockLimits documentation, and maybe adding in a preferred size for yet
another parameter that we learn from stat() on files (there really are
performance benefits for performing I/O on a file in aligned requests,
even if it can do byte-wise manipulations).  And maybe the NBD
documentation needs a tweak too - which values are the most beneficial
to expose over the wire?  Should NBD be exposing more than just
min/preferred/max?
diff mbox

Patch

diff --git a/nbd/server.c b/nbd/server.c
index 3b1a4a5..d63e5d3 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -409,6 +409,8 @@  static int nbd_negotiate_handle_info(NBDClient *client, uint32_t length,
     uint16_t request;
     uint32_t namelen;
     bool sendname = false;
+    bool blocksize = false;
+    uint32_t sizes[3];
     char buf[sizeof(uint64_t) + sizeof(uint16_t)];
     const char *msg;

@@ -444,11 +446,16 @@  static int nbd_negotiate_handle_info(NBDClient *client, uint32_t length,
         length -= sizeof(request);
         TRACE("Client requested info %d (%s)", request,
               nbd_info_lookup(request));
-        /* For now, we only care about NBD_INFO_NAME; everything else
-         * is either a request we don't know or something we send
-         * regardless of request. */
-        if (request == NBD_INFO_NAME) {
+        /* We care about NBD_INFO_NAME and NBD_INFO_BLOCK_SIZE;
+         * everything else is either a request we don't know or
+         * something we send regardless of request */
+        switch (request) {
+        case NBD_INFO_NAME:
             sendname = true;
+            break;
+        case NBD_INFO_BLOCK_SIZE:
+            blocksize = true;
+            break;
         }
     }

@@ -499,6 +506,27 @@  static int nbd_negotiate_handle_info(NBDClient *client, uint32_t length,
         }
     }

+    /* Send NBD_INFO_BLOCK_SIZE always, but tweak the minimum size
+     * according to whether the client requested it, and according to
+     * whether this is OPT_INFO or OPT_GO. */
+    /* minimum - 1 for back-compat, or 512 if client is new enough.
+     * TODO: consult blk_bs(blk)->request_align? */
+    sizes[0] = (opt == NBD_OPT_INFO || blocksize) ? BDRV_SECTOR_SIZE : 1;
+    /* preferred - At least 4096, but larger as appropriate. */
+    sizes[1] = MAX(blk_get_opt_transfer(exp->blk), 4096);
+    /* maximum - At most 32M, but smaller as appropriate. */
+    sizes[2] = MIN(blk_get_max_transfer(exp->blk), NBD_MAX_BUFFER_SIZE);
+    TRACE("advertising minimum 0x%" PRIx32 ", preferred 0x%" PRIx32
+          ", maximum 0x%" PRIx32, sizes[0], sizes[1], sizes[2]);
+    cpu_to_be32s(&sizes[0]);
+    cpu_to_be32s(&sizes[1]);
+    cpu_to_be32s(&sizes[2]);
+    rc = nbd_negotiate_send_info(client, opt, NBD_INFO_BLOCK_SIZE,
+                                 sizeof(sizes), sizes);
+    if (rc < 0) {
+        return rc;
+    }
+
     /* Send NBD_INFO_EXPORT always */
     TRACE("advertising size %" PRIu64 " and flags %" PRIx16,
           exp->size, exp->nbdflags | myflags);
@@ -510,6 +538,17 @@  static int nbd_negotiate_handle_info(NBDClient *client, uint32_t length,
         return rc;
     }

+    /* If the client is just asking for NBD_OPT_INFO, but forgot to
+     * request block sizes, return an error.
+     * TODO: consult blk_bs(blk)->request_align, and only error if it
+     * is not 1? */
+    if (opt == NBD_OPT_INFO && !blocksize) {
+        return nbd_negotiate_send_rep_err(client->ioc,
+                                          NBD_REP_ERR_BLOCK_SIZE_REQD, opt,
+                                          "request NBD_INFO_BLOCK_SIZE to "
+                                          "use this export");
+    }
+
     /* Final reply */
     rc = nbd_negotiate_send_rep(client->ioc, NBD_REP_ACK, opt);
     if (rc < 0) {