diff mbox

Further tidy-up on block status

Message ID 20161214150840.10899-1-alex@alex.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Alex Bligh Dec. 14, 2016, 3:08 p.m. UTC
(NB: I've already applied this and pushed it)

* Change NBD_OPT_LIST_METADATA etc. to explicitly send a list of queries
  and add a count of queries so we can extend the command later (rather than
  rely on the length of option)

* For NBD_OPT_LIST_METADATA make absence of any query (as opposed to zero
  length query) list all contexts, as absence of any query is now simple.

* Move definition of namespaces in the document to somewhere more appopriate.

* Various other minor changes as discussed on the mailing list

Signed-off-by: Alex Bligh <alex@alex.org.uk>
---
 doc/proto.md | 179 +++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 112 insertions(+), 67 deletions(-)

Comments

Vladimir Sementsov-Ogievskiy Dec. 14, 2016, 4:25 p.m. UTC | #1
14.12.2016 18:08, Alex Bligh wrote:
> (NB: I've already applied this and pushed it)
>
> * Change NBD_OPT_LIST_METADATA etc. to explicitly send a list of queries
>    and add a count of queries so we can extend the command later (rather than
>    rely on the length of option)
>
> * For NBD_OPT_LIST_METADATA make absence of any query (as opposed to zero
>    length query) list all contexts, as absence of any query is now simple.
>
> * Move definition of namespaces in the document to somewhere more appopriate.
>
> * Various other minor changes as discussed on the mailing list
>
> Signed-off-by: Alex Bligh <alex@alex.org.uk>
> ---
>   doc/proto.md | 179 +++++++++++++++++++++++++++++++++++++----------------------
>   1 file changed, 112 insertions(+), 67 deletions(-)
>
> diff --git a/doc/proto.md b/doc/proto.md
> index e03f434..6955c76 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -683,15 +683,20 @@ by other users, so has been moved out of the "experimental" section.
>   
>   ## Metadata querying
>   
> -With the availability of sparse storage formats, it is often needed to
> -query the status of a particular range and read only those blocks of
> -data that are actually present on the block device.
> +It is often helpful for the client to be able to query the status
> +of a range of blocks. The nature of the status that can be
> +queried is in part implementation dependent. For instance,
> +the status might represent:
>   
> -Some storage formats and operations over such formats express a
> -concept of data dirtiness. Whether the operation is block device
> -mirroring, incremental block device backup or any other operation with
> -a concept of data dirtiness, they all share a need to provide a list
> -of ranges that this particular operation treats as dirty.
> +* in a sparse storage format, whether the relevant blocks are
> +  actually present on the backing device for the export; or
> +
> +* whether the relevant blocks are 'dirty'; some storage formats
> +  and operations over such formats express a concept of data dirtiness.
> +  Whether the operation is block device mirroring, incremental block
> +  device backup or any other operation with a concept of data dirtiness,
> +  they all share a need to provide a list of ranges that this
> +  particular operation treats as dirty.
>   
>   To provide such classes of information, the NBD protocol has a generic
>   framework for querying metadata; however, its use must first be
> @@ -699,9 +704,11 @@ negotiated, and one or more metadata contexts must be selected.
>   
>   The procedure works as follows:
>   
> -- First, during negotiation, the client MUST select one or more metadata
> +- First, during negotiation, if the client wishes to query metadata
> +  during transmission, the client MUST select one or more metadata
>     contexts with the `NBD_OPT_SET_META_CONTEXT` command. If needed, the client
> -  can use `NBD_OPT_LIST_META_CONTEXT` to list contexts.
> +  can use `NBD_OPT_LIST_META_CONTEXT` to list contexts that the server
> +  supports.
>   - During transmission, a client can then indicate interest in metadata
>     for a given region by way of the `NBD_CMD_BLOCK_STATUS` command, where
>     *offset* and *length* indicate the area of interest. The server MUST
> @@ -715,13 +722,37 @@ The procedure works as follows:
>     of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
>   
>   A client MUST NOT use `NBD_CMD_BLOCK_STATUS` unless it selected a
> -nonzero number of metadata contexts during negotiation. Servers SHOULD
> -reply to clients doing so anyway with `EINVAL`.
> +non-zero number of metadata contexts during negotiation. Servers SHOULD
> +reply to clients sending `NBD_CMD_BLOCK_STATUS without

backquote

> +selecting metadata contexts with `EINVAL`.
>   
> -The reply to the `NBD_CMD_BLOCK_STATUS` request MUST be sent by a
> +The reply to the `NBD_CMD_BLOCK_STATUS` request MUST be sent as a
>   structured reply; this implies that in order to use metadata querying,
>   structured replies MUST be negotiated first.
>   
> +Metadata contexts are identified by their names. The name MUST
> +consist of a namespace, followed by a colon, followed by a leaf-name.
> +The namespace and the leaf-name must each consist entirely of
> +printable non-whitespace UTF-8 characters other than colons,
> +and be non-empty. The entire name (namespace, colon and leaf-name)
> +MUST NOT exceed 255 bytes (and therefore botht he namespace and
> +leaf-name are guaranteed to be smaller than 255 bytes).
> +
> +Namespaces MUST be consist of one of the following:
> +- `base`, for metadata contexts defined by this document;
> +- `nbd-server`, for metadata contexts defined by the
> +   implementation that accompanies this document (none
> +   currently);
> +- `x-*`, where `*` can be replaced by any random string not
> +   containing colons, for local experiments. This SHOULD NOT be
> +   used by metadata contexts that are expected to be widely used.
> +- A third-party namespace from the list below.
> +
> +Third-party implementations can register additional namespaces by simple
> +request to the mailing-list. The following additional third-party namespaces
> +are currently registered:
> +* (none)
> +
>   This standard defines exactly one metadata context; it is called
>   `base:allocation`, and it provides information on the basic allocation
>   status of extents (that is, whether they are allocated at all in a
> @@ -918,9 +949,7 @@ of the newstyle negotiation.
>   - `NBD_OPT_LIST_META_CONTEXT` (10)
>   
>       Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
> -    followed by an `NBD_REP_ACK`. If a server replies to such a request
> -    with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
> -    commands during the transmission phase.
> +    followed by an `NBD_REP_ACK`.
>   
>       If the query string is syntactically invalid, the server SHOULD send
>       `NBD_REP_ERR_INVALID`. If the query string is syntactically valid
> @@ -932,51 +961,58 @@ of the newstyle negotiation.
>       SHOULD send `NBD_REP_ERR_INVALID`.
>   
>       Data:
> -    - 32 bits, length of export name
> -    - String, name of export for which we wish to list or select metadata
> +    - 32 bits, length of export name.
> +    - String, name of export for which we wish to list metadata
>         contexts.
> -    - 32 bits, length of query
> -    - String, query to select a subset of the available metadata
> -      contexts. If this is not specified (i.e., the "length of query"
> -      field is 0 and no query is sent), then the server MUST send all
> -      the metadata contexts it knows about. If specified, this query
> -      string MUST start with a name that uniquely identifies a server
> -      implementation; e.g., the reference implementation that
> -      accompanies this document would support query strings starting
> -      with 'nbd-server:'
> -
> -    The server MUST reply with a list of `NBD_REP_META_CONTEXT` replies,
> -    followed by `NBD_REP_ACK`. The metadata context ID in these replies
> -    is reserved and SHOULD be set to zero; clients SHOULD disregard it.
> +    - 32 bits, number of queries
> +    - Zero or more queries, each being:
> +       - 32 bits, length of query.
> +       - String, query to list a subset of the available metadata
> +         contexts.
> +
> +    If zero queries are sent, then the server MUST return all
> +    the metadata contexts it knows about.

I think that 'all .. it knows about' is too much. What about 'return all 
available ..'? Anyway 'all ... it knows about' actually equals to 'all 
... it wants'. There may be some private, or unrelated contexts, for 
example..

> +
> +    For details on the query string, see under `NBD_OPT_SET_META_CONTEXT`.
> +
> +    The server MUST either reply with an error (for instance `EINVAL`
> +    if the option is not supported), or reply with a list of
> +    `NBD_REP_META_CONTEXT` replies followed by `NBD_REP_ACK`.
> +    The metadata context ID in these replies is reserved and SHOULD be
> +    set to zero; clients MUST disregard it.
>   
>   - `NBD_OPT_SET_META_CONTEXT` (11)
>   
>       Change the set of active metadata contexts. Issuing this command
>       replaces all previously-set metadata contexts; clients must ensure
> -    that all metadata contexts they're interested in are selected with
> +    that all metadata contexts they are interested in are selected with
>       the final query that they sent.
>   
> +    A client MUST NOT send `NBD_CMD_BLOCK_STATUS` unless
> +    within the negotiation phase it sent `NBD_OPT_SET_META_CONTEXT`
> +    at least once, and the final time it was sent, the server
> +    responded without an error, and returned at least one metadata
> +    context.
> +
>       Data:
> -    - 32 bits, length of query
> -    - String, query to select metadata contexts. The syntax of this
> -      query is implementation-defined, except that it MUST start with a
> -      namespace. This namespace may be one of the following:
> -        - `base:`, for metadata contexts defined by this document;
> -        - `nbd-server:`, for metadata contexts defined by the
> -          implementation that accompanies this document (none
> -          currently);
> -        - `x-*:`, where `*` can be replaced by any random string not
> -          containing colons, for local experiments. This SHOULD NOT be
> -          used by metadata contexts that are expected to be widely used.
> -        - third-party implementations can register additional
> -          namespaces by simple request to the mailinglist.
> -
> -    These two fields MAY be repeated as much as is necessary to select all
> -    metadata contexts the client is interested in.
> +    - 32 bits, length of export name.
> +    - String, name of export for which we wish to list metadata
> +      contexts.
> +    - 32 bits, number of queries
> +    - Zero or more queries, each being:
> +       - 32 bits, length of query
> +       - String, query to select metadata contexts. The syntax of this
> +         query is implementation-defined, except that it MUST start with a
> +         namespace and a colon.
> +
> +    If zero queries are sent, the server MUST select no metadata contexts.
>   
>       The server MUST reply with a number of `NBD_REP_META_CONTEXT`
>       replies, one for each selected metadata context, each with a unique
> -    metadata context ID. It is not an error if a
> +    metadata context ID, followed by `NBD_REP_ACK`. The metadata context
> +    ID is transient and may vary across calls to `NBD_OPT_SET_META_CONTEXT`;
> +    clients MUST therefore treat the ID as an opaque value and not (for
> +    instance) cache it between connections. It is not an error if a
>       `NBD_OPT_SET_META_CONTEXT` option does not select any metadata
>       context, provided the client then does not attempt to issue
>       `NBD_CMD_BLOCK_STATUS` commands.
> @@ -1071,9 +1107,10 @@ The `base:allocation` metadata context is the basic "allocated at all"
>   metadata context. If an extent is marked with `NBD_STATE_HOLE` at that
>   context, this means that the given extent is not allocated in the
>   backend storage, and that writing to the extent MAY result in the ENOSPC
> -error. This supports sparse file semantics on the server side. If a
> -server has only one metadata context (the default), then writing to an
> -extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.
> +error. This supports sparse file semantics on the server side.
> +If a server supports the `base:allocation` metadata context, then writing
> +to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC
> +unless for reasons specified in the definition of another context.
>   
>   It defines the following flags for the flags field:
>   
> @@ -1082,20 +1119,21 @@ It defines the following flags for the flags field:
>     `ENOSPC` error); if clear, the block is allocated or the server could
>     not otherwise determine its status. Note that the use of
>     `NBD_CMD_TRIM` is related to this status, but that the server MAY
> -  report a hole even where trim has not been requested, and also that a
> -  server MAY report metadata even where a trim has been requested.
> +  report a hole even where `NBD_CMD_TRIM` has not been requested, and
> +  also that a server MAY report that the block is allocated even where
> +  `NBD_CMD_TRIM` has been requested.
>   - `NBD_STATE_ZERO` (bit 1): if set, the block contents read as all
>     zeroes; if clear, the block contents are not known. Note that the use
>     of `NBD_CMD_WRITE_ZEROES` is related to this status, but that the
> -  server MAY report zeroes even where write zeroes has not been
> +  server MAY report zeroes even where `NBD_CMD_WRITE_ZEROES` has not been
>     requested, and also that a server MAY report unknown content even
> -  where write zeroes has been requested.
> +  where `NBD_CMD_WRITE_ZEROES` has been requested.
>   
>   It is not an error for a server to report that a region of the
>   export has both `NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear. The
> -contents of such an area is undefined, and may not be stable;
> -clients who are aware of the existence of such a region SHOULD NOT
> -read it.
> +contents of such an area are undefined, and a client
> +reading such an area should make no assumption as to its contents
> +or stability.
>   
>   For the `base:allocation` context, the remainder of the flags field is
>   reserved. Servers SHOULD set it to all-zero; clients MUST ignore unknown
> @@ -1150,9 +1188,10 @@ valid may depend on negotiation during the handshake phase.
>      `EOVERFLOW` error chunk, if the request length is too large.
>   - bit 3, `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`. If
>     set, the client is interested in only one extent per metadata
> -  context. If this flag is present, the server SHOULD NOT send metadata
> -  on more than one extent in the reply. Clients SHOULD NOT use this flag
> -  on multiple requests for successive regions in the export.
> +  context. If this flag is present, the server MUST NOT send metadata
> +  on more than one extent in the reply. Client implementors should note
> +  that using this flag on multiple contiguous requests is likely to be
> +  inefficient.
>   
>   ##### Structured reply flags
>   
> @@ -1225,7 +1264,7 @@ interpret the "length" bytes of payload.
>   
>       *length* MUST be 4 + (a positive integer multiple of 8).  This reply
>       represents a series of consecutive block descriptors where the sum
> -    of the lengths of the descriptors MUST not be greater than the
> +    of the length fields within the descriptors MUST not be greater than the
>       length of the original request. This chunk type MUST appear exactly
>       once per metadata ID in a structured reply.
>   
> @@ -1236,14 +1275,15 @@ interpret the "length" bytes of payload.
>       and is followed by a list of one or more descriptors, each with this
>       layout:
>   
> -        * 32 bits, length (unsigned, MUST NOT be zero)
> +        * 32 bits, length of the extent to which the status below
> +          applies (unsigned, MUST be non-zero)
>           * 32 bits, status flags
>   
>       If the client used the `NBD_CMD_FLAG_REQ_ONE` flag in the request,
>       then every reply chunk MUST NOT contain more than one descriptor.
>   
>       Even if the client did not use the `NBD_CMD_FLAG_REQ_ONE` flag in
> -    its request, the server MAY return less descriptors in the reply
> +    its request, the server MAY return fewer descriptors in the reply
>       than would be required to fully specify the whole range of requested
>       information to the client, if looking up the information would be
>       too resource-intensive for the server, so long as at least one
> @@ -1462,8 +1502,13 @@ The following request types exist:
>   * `NBD_CMD_BLOCK_STATUS` (7)
>   
>       A block status query request. Length and offset define the range of
> -    interest. Clients MUST NOT use this request unless metadata
> -    contexts have been negotiated, which in turn requires the client to
> +    interest.
> +
> +    A client MUST NOT send `NBD_CMD_BLOCK_STATUS` unless
> +    within the negotiation phase it sent `NBD_OPT_SET_META_CONTEXT`
> +    at least once, and the final time it was sent, the server
> +    responded without an error, and returned at least one metadata
> +    context. This in turn requires the client to
>       first negotiate structured replies. For a successful return, the
>       server MUST use a structured reply, containing at least one chunk of
>       type `NBD_REPLY_TYPE_BLOCK_STATUS`, where the status field of each
Alex Bligh Dec. 14, 2016, 4:38 p.m. UTC | #2
Vladimir,

>> +non-zero number of metadata contexts during negotiation. Servers SHOULD
>> +reply to clients sending `NBD_CMD_BLOCK_STATUS without
> 
> backquote

Fixed

>> +    If zero queries are sent, then the server MUST return all
>> +    the metadata contexts it knows about.
> 
> I think that 'all .. it knows about' is too much. What about 'return all available ..'? Anyway 'all ... it knows about' actually equals to 'all ... it wants'. There may be some private, or unrelated contexts, for example..

This was not my wording, but I've changed it anyway to:

    If zero queries are sent, then the server MUST return all
    the metadata contexts that are available to	the client to select
    on the given export with `NBD_OPT_SET_META_CONTEXT`.

I think if they are available to select, we should list them. Thanks
also for reminding me to document why I put the export name into the
_LIST_ data (as it is for _SET_).

However, this raises another question. Wouter deliberately made the
query format freeform so that you could e.g. set a context like:

   backup:modtime>201612081034

which might (in theory) return a list of blocks which are newer than
the given timestamp. It would clearly be impossible to return all such
contexts. I wonder if we should carve out an exception here.
Eric Blake Dec. 14, 2016, 4:58 p.m. UTC | #3
On 12/14/2016 09:08 AM, Alex Bligh wrote:
> (NB: I've already applied this and pushed it)
> 
> * Change NBD_OPT_LIST_METADATA etc. to explicitly send a list of queries
>   and add a count of queries so we can extend the command later (rather than
>   rely on the length of option)
> 
> * For NBD_OPT_LIST_METADATA make absence of any query (as opposed to zero
>   length query) list all contexts, as absence of any query is now simple.
> 
> * Move definition of namespaces in the document to somewhere more appopriate.
> 
> * Various other minor changes as discussed on the mailing list
> 

> +Metadata contexts are identified by their names. The name MUST
> +consist of a namespace, followed by a colon, followed by a leaf-name.
> +The namespace and the leaf-name must each consist entirely of
> +printable non-whitespace UTF-8 characters other than colons,
> +and be non-empty. The entire name (namespace, colon and leaf-name)
> +MUST NOT exceed 255 bytes (and therefore botht he namespace and

s/botht he/both the/
Alex Bligh Dec. 14, 2016, 5:03 p.m. UTC | #4
> On 14 Dec 2016, at 16:58, Eric Blake <eblake@redhat.com> wrote:
> 
> s/botht he/both the/

Thanks - fixed

--
Alex Bligh
Vladimir Sementsov-Ogievskiy Dec. 14, 2016, 5:05 p.m. UTC | #5
14.12.2016 19:38, Alex Bligh wrote:
> Vladimir,
>
>>> +non-zero number of metadata contexts during negotiation. Servers SHOULD
>>> +reply to clients sending `NBD_CMD_BLOCK_STATUS without
>> backquote
> Fixed
>
>>> +    If zero queries are sent, then the server MUST return all
>>> +    the metadata contexts it knows about.
>> I think that 'all .. it knows about' is too much. What about 'return all available ..'? Anyway 'all ... it knows about' actually equals to 'all ... it wants'. There may be some private, or unrelated contexts, for example..
> This was not my wording, but I've changed it anyway to:
>
>      If zero queries are sent, then the server MUST return all
>      the metadata contexts that are available to	the client to select
>      on the given export with `NBD_OPT_SET_META_CONTEXT`.
>
> I think if they are available to select, we should list them. Thanks
> also for reminding me to document why I put the export name into the
> _LIST_ data (as it is for _SET_).
>
> However, this raises another question. Wouter deliberately made the
> query format freeform so that you could e.g. set a context like:
>
>     backup:modtime>201612081034
>
> which might (in theory) return a list of blocks which are newer than
> the given timestamp. It would clearly be impossible to return all such
> contexts. I wonder if we should carve out an exception here.
>

Interesting. Even query 'backup:modtime>*' would be a problem, not only 
empty query list.

Actually, we do not need to 'list contexts', as it is about management, 
not about data transfer. We only need a way to check, that particular 
query selects all needed contexts and no others. Just to be sure that we 
are know, what we will select by NBD_OPT_SET_META_CONTEXT with _same_ query.

So, I suggest just to say, that _LIST_ can return error if too much 
contexts are selected.|And same should be done for _SET_. And it should 
be documented that this result of query (list or error) should be equal 
for these two commands.
|
Wouter Verhelst Dec. 14, 2016, 5:09 p.m. UTC | #6
On Wed, Dec 14, 2016 at 03:08:40PM +0000, Alex Bligh wrote:
> (NB: I've already applied this and pushed it)

Thanks.

> * Change NBD_OPT_LIST_METADATA etc. to explicitly send a list of queries
>   and add a count of queries so we can extend the command later (rather than
>   rely on the length of option)

Sure, that works.

> * For NBD_OPT_LIST_METADATA make absence of any query (as opposed to zero
>   length query) list all contexts, as absence of any query is now simple.
> 
> * Move definition of namespaces in the document to somewhere more appopriate.
> 
> * Various other minor changes as discussed on the mailing list

Right. I think we're getting close to a good spec now, for this thing.

One thing I've been thinking about that we might want to add:

There may be cases where a server, in performing the required calls to
be able to handle a BLOCK_STATUS request, will end up with more
information than the client asked; e.g., if the client asked information
in the base:allocation context on an extent at offset X of length Y,
then the server might conceivably do an lseek(SEEK_DATA) and/or
lseek(SEEK_HOLE). The result of that call might be that the file offset
will now point to a location Z, where Z > (X+Y).

Currently, our spec says "the sum of the *length* fields MUST not be
greater than the overall *length* of request". This means that in
essense, the server then has to throw away the information it has on the
range Z - (X + Y). In case a client was interested in that information,
that seems like a waste. I would therefore like to remove that
requirement, by rephrasing that "sum of the *length* fields" thing into
something along the following lines:

  In case the server returns N extents, the sum of the *length* fields
  of the first N-1 extents MUST NOT be greater than the overall *length*
  of the request. The final extent MAY exceed the length of the request
  if the server has that information anyway as a side effect of looking
  up the required information and wishes to share it.

This would then result in the fact that the "length" field in the
BLOCK_STATUS command would be little more than a hint, since we're
saying that a server can return more data than requested (if it's
available anyway) and less data than requested (if it would be too
resource-intensive to provide all the information). Not sure whether all
that makes much sense anymore, but hey.

In addition, the combination of a server providing more information than
requested with a "REQ_ONE" flag and a length field of zero could be an
interesting way to enumerate a whole export, too. Essentially, we could
define that as a client saying "I'm interested in what the size of the
extent at offset X is, and what its properties are".

Thoughts?
Vladimir Sementsov-Ogievskiy Dec. 14, 2016, 5:33 p.m. UTC | #7
14.12.2016 20:09, Wouter Verhelst wrote:
> On Wed, Dec 14, 2016 at 03:08:40PM +0000, Alex Bligh wrote:
>> (NB: I've already applied this and pushed it)
> Thanks.
>
>> * Change NBD_OPT_LIST_METADATA etc. to explicitly send a list of queries
>>    and add a count of queries so we can extend the command later (rather than
>>    rely on the length of option)
> Sure, that works.
>
>> * For NBD_OPT_LIST_METADATA make absence of any query (as opposed to zero
>>    length query) list all contexts, as absence of any query is now simple.
>>
>> * Move definition of namespaces in the document to somewhere more appopriate.
>>
>> * Various other minor changes as discussed on the mailing list
> Right. I think we're getting close to a good spec now, for this thing.
>
> One thing I've been thinking about that we might want to add:
>
> There may be cases where a server, in performing the required calls to
> be able to handle a BLOCK_STATUS request, will end up with more
> information than the client asked; e.g., if the client asked information
> in the base:allocation context on an extent at offset X of length Y,
> then the server might conceivably do an lseek(SEEK_DATA) and/or
> lseek(SEEK_HOLE). The result of that call might be that the file offset
> will now point to a location Z, where Z > (X+Y).
>
> Currently, our spec says "the sum of the *length* fields MUST not be
> greater than the overall *length* of request". This means that in
> essense, the server then has to throw away the information it has on the
> range Z - (X + Y). In case a client was interested in that information,
> that seems like a waste. I would therefore like to remove that
> requirement, by rephrasing that "sum of the *length* fields" thing into
> something along the following lines:
>
>    In case the server returns N extents, the sum of the *length* fields
>    of the first N-1 extents MUST NOT be greater than the overall *length*
>    of the request. The final extent MAY exceed the length of the request
>    if the server has that information anyway as a side effect of looking
>    up the required information and wishes to share it.
>
> This would then result in the fact that the "length" field in the
> BLOCK_STATUS command would be little more than a hint, since we're
> saying that a server can return more data than requested (if it's
> available anyway) and less data than requested (if it would be too
> resource-intensive to provide all the information). Not sure whether all
> that makes much sense anymore, but hey.
>
> In addition, the combination of a server providing more information than
> requested with a "REQ_ONE" flag and a length field of zero could be an
> interesting way to enumerate a whole export, too. Essentially, we could
> define that as a client saying "I'm interested in what the size of the
> extent at offset X is, and what its properties are".
>
> Thoughts?
>

Good, I'm for it. May be, in such wording there are too much information 
about implementation (again, server can do it if it _wants_, is it a 
side effect or not is implementation defined). In other words, "if the 
server..." is better read as an example, not requirement. But it's not 
important.
Alex Bligh Dec. 14, 2016, 5:36 p.m. UTC | #8
> On 14 Dec 2016, at 17:05, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
>> 
>> However, this raises another question. Wouter deliberately made the
>> query format freeform so that you could e.g. set a context like:
>> 
>>    backup:modtime>201612081034
>> 
>> which might (in theory) return a list of blocks which are newer than
>> the given timestamp. It would clearly be impossible to return all such
>> contexts. I wonder if we should carve out an exception here.
>> 
>> 
> 
> Interesting. Even query 'backup:modtime>*' would be a problem, not only empty query list.
> 
> Actually, we do not need to 'list contexts', as it is about management, not about data transfer. We only need a way to check, that particular query selects all needed contexts and no others. Just to be sure that we are know, what we will select by NBD_OPT_SET_META_CONTEXT with _same_ query.
> 
> So, I suggest just to say, that _LIST_ can return error if too much contexts are selected. And same should be done for _SET_. And it should be documented that this result of query (list or error) should be equal for these two commands.

(two CCs that always bounce removed)

Hmmm... Well in the '*' case, it's up to the namespace as to how it parses things, so '*' could be prohibited entirely or mean 'return the first 20 matching' or similar. So that's less of a problem. And 'set all' doesn't exist (unlike 'list all').

I think in the LIST case we should allow the server to omit contexts under certain circumstances, and allow SET to return ETOOBIG.
Alex Bligh Dec. 14, 2016, 5:39 p.m. UTC | #9
Wouter,

> Right. I think we're getting close to a good spec now, for this thing.
> 
> One thing I've been thinking about that we might want to add:
> 
> There may be cases where a server, in performing the required calls to
> be able to handle a BLOCK_STATUS request, will end up with more
> information than the client asked; e.g., if the client asked information
> in the base:allocation context on an extent at offset X of length Y,
> then the server might conceivably do an lseek(SEEK_DATA) and/or
> lseek(SEEK_HOLE). The result of that call might be that the file offset
> will now point to a location Z, where Z > (X+Y).
> 
> Currently, our spec says "the sum of the *length* fields MUST not be
> greater than the overall *length* of request". This means that in
> essense, the server then has to throw away the information it has on the
> range Z - (X + Y). In case a client was interested in that information,
> that seems like a waste. I would therefore like to remove that
> requirement, by rephrasing that "sum of the *length* fields" thing into
> something along the following lines:
> 
>  In case the server returns N extents, the sum of the *length* fields
>  of the first N-1 extents MUST NOT be greater than the overall *length*
>  of the request. The final extent MAY exceed the length of the request
>  if the server has that information anyway as a side effect of looking
>  up the required information and wishes to share it.
> 
> This would then result in the fact that the "length" field in the
> BLOCK_STATUS command would be little more than a hint, since we're
> saying that a server can return more data than requested (if it's
> available anyway) and less data than requested (if it would be too
> resource-intensive to provide all the information). Not sure whether all
> that makes much sense anymore, but hey.

+1

> In addition, the combination of a server providing more information than
> requested with a "REQ_ONE" flag and a length field of zero could be an
> interesting way to enumerate a whole export, too. Essentially, we could
> define that as a client saying "I'm interested in what the size of the
> extent at offset X is, and what its properties are".

Also +1
Vladimir Sementsov-Ogievskiy Dec. 14, 2016, 5:55 p.m. UTC | #10
14.12.2016 20:36, Alex Bligh wrote:
>> On 14 Dec 2016, at 17:05, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
>>> However, this raises another question. Wouter deliberately made the
>>> query format freeform so that you could e.g. set a context like:
>>>
>>>     backup:modtime>201612081034
>>>
>>> which might (in theory) return a list of blocks which are newer than
>>> the given timestamp. It would clearly be impossible to return all such
>>> contexts. I wonder if we should carve out an exception here.
>>>
>>>
>> Interesting. Even query 'backup:modtime>*' would be a problem, not only empty query list.
>>
>> Actually, we do not need to 'list contexts', as it is about management, not about data transfer. We only need a way to check, that particular query selects all needed contexts and no others. Just to be sure that we are know, what we will select by NBD_OPT_SET_META_CONTEXT with _same_ query.
>>
>> So, I suggest just to say, that _LIST_ can return error if too much contexts are selected. And same should be done for _SET_. And it should be documented that this result of query (list or error) should be equal for these two commands.
> (two CCs that always bounce removed)
>
> Hmmm... Well in the '*' case, it's up to the namespace as to how it parses things, so '*' could be prohibited entirely or mean 'return the first 20 matching' or similar. So that's less of a problem. And 'set all' doesn't exist (unlike 'list all').
>
> I think in the LIST case we should allow the server to omit contexts under certain circumstances, and allow SET to return ETOOBIG.
>

We can't prohibit '*' as query string as implementation defined. And we 
shouldn't (I think) define its meaning. Opposite, we can define, that 
query must not select more than 20 contexts, but I'm not sure that this 
is good.

So, do you mean

   list('backup:modtime>*') -> empty list
   set('backup:modtime>*') -> ETOOBIG

? For me this looks strange.
Wouter Verhelst Dec. 14, 2016, 6:13 p.m. UTC | #11
On Wed, Dec 14, 2016 at 08:55:56PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 14.12.2016 20:36, Alex Bligh wrote:
> >> On 14 Dec 2016, at 17:05, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
> >>> However, this raises another question. Wouter deliberately made the
> >>> query format freeform so that you could e.g. set a context like:
> >>>
> >>>     backup:modtime>201612081034
> >>>
> >>> which might (in theory) return a list of blocks which are newer than
> >>> the given timestamp. It would clearly be impossible to return all such
> >>> contexts. I wonder if we should carve out an exception here.
> >>>
> >>>
> >> Interesting. Even query 'backup:modtime>*' would be a problem, not only empty query list.
> >>
> >> Actually, we do not need to 'list contexts', as it is about management, not about data transfer. We only need a way to check, that particular query selects all needed contexts and no others. Just to be sure that we are know, what we will select by NBD_OPT_SET_META_CONTEXT with _same_ query.
> >>
> >> So, I suggest just to say, that _LIST_ can return error if too much contexts are selected. And same should be done for _SET_. And it should be documented that this result of query (list or error) should be equal for these two commands.
> > (two CCs that always bounce removed)
> >
> > Hmmm... Well in the '*' case, it's up to the namespace as to how it parses things, so '*' could be prohibited entirely or mean 'return the first 20 matching' or similar. So that's less of a problem. And 'set all' doesn't exist (unlike 'list all').
> >
> > I think in the LIST case we should allow the server to omit contexts under certain circumstances, and allow SET to return ETOOBIG.
> >
> 
> We can't prohibit '*' as query string as implementation defined. And we 
> shouldn't (I think) define its meaning. Opposite, we can define, that 
> query must not select more than 20 contexts, but I'm not sure that this 
> is good.

No, I don't think so, either.

I can see how "list all contexts" might be difficult to implement, in
the example of the dynamic contexts that I pointed out earlier. We could
muddle the spec further to fix that (hypothetical) problem, but I'm not
sure that's worth it.

Instead, I would suggest that the spec leave it up to the namespace spec
to define what gets listed when. The namespace should still give some
indication that a particular *type* of context is available, though;
e.g., _LIST_ could return 'backup:snapshot-diff{*}' as well as a list of
'backup:snapshot{snapshot-X}' contexts, where the latter are all
available snapshots, and the 'snapshot-diff' bit implies that the whole
cartesian product of all possible diffs between snapshots is possible in
a format of 'backup:snapshot-diff{snapshot-X:snapshot-Y}', or some such
(you get the idea).

The spec currently says that a server SHOULD allow choosing exports by
simply naming them, but doesn't make that a MUST; and it also says that
the query format is defined by the namespace. If we simply clarify that
the output of _LIST_ doesn't necessarily need to be a full list of all
that's available (that it may be symbolic in the case of namespaces that
allow things to be dynamically defined), we should be good.
Alex Bligh Dec. 14, 2016, 6:17 p.m. UTC | #12
> On 14 Dec 2016, at 17:55, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
> 
>> Hmmm... Well in the '*' case, it's up to the namespace as to how it parses things, so '*' could be prohibited entirely or mean 'return the first 20 matching' or similar. So that's less of a problem. And 'set all' doesn't exist (unlike 'list all').
>> 
>> I think in the LIST case we should allow the server to omit contexts under certain circumstances, and allow SET to return ETOOBIG.
>> 
> 
> We can't prohibit '*' as query string as implementation defined. And we shouldn't (I think) define its meaning.

Those two statements appear to me to contradict each other. The current documentation *does* define the query string (to the right of the colon) as implementation defined. I'm thus not sure what you mean.

> Opposite, we can define, that query must not select more than 20 contexts, but I'm not sure that this is good.

Each query can select from only one namespace in the current document (Wouter explained why). However, you can specify multiple queries.

I don't think we need to define a hard limit of a particular number.

> So, do you mean
> 
>  list('backup:modtime>*') -> empty list

The result of that would depend on however the 'backup' context was defined, meaning it had the options of:
* ETOOBIG
* Listing some subset of available contexts (arguably 'none' is a subset)

>  set('backup:modtime>*') -> ETOOBIG

Again, the result of that would depend on however the 'backup' context was defined, meaning it had the options of:
* ETOOBIG
* Listing some subset of available contexts (arguably 'none' is a subset)
... and it need not make the same choice as above, but I agree it would be logical for it to do so.

As any form of wildcarding within a query refers to one particular namespace (as a query by its nature specifies a single namespace), we don't have to worry about the way wildcarding works in the standard, as Wouter pointed out. We merely need to provide that ETOOBIG and listing a subset are acceptable responses.

What we do need to decide is what the response to list() (i.e. a list with a zero length list of queries) does. It's currently defined to return every context from every namespace. Options would include
* ETOOBIG
* Listing some subset of available contexts (arguably 'none' is a subset)
* Allowing abbreviation of algorithmically specified contexts (e.g. in this case just returning 'backup:' to represent all available backup contexts)
Alex Bligh Dec. 14, 2016, 6:18 p.m. UTC | #13
> On 14 Dec 2016, at 18:13, Wouter Verhelst <w@uter.be> wrote:
> 
> Instead, I would suggest that the spec leave it up to the namespace spec
> to define what gets listed when. The namespace should still give some
> indication that a particular *type* of context is available, though;
> e.g., _LIST_ could return 'backup:snapshot-diff{*}' as well as a list of
> 'backup:snapshot{snapshot-X}' contexts, where the latter are all
> available snapshots, and the 'snapshot-diff' bit implies that the whole
> cartesian product of all possible diffs between snapshots is possible in
> a format of 'backup:snapshot-diff{snapshot-X:snapshot-Y}', or some such
> (you get the idea).
> 
> The spec currently says that a server SHOULD allow choosing exports by
> simply naming them, but doesn't make that a MUST; and it also says that
> the query format is defined by the namespace. If we simply clarify that
> the output of _LIST_ doesn't necessarily need to be a full list of all
> that's available (that it may be symbolic in the case of namespaces that
> allow things to be dynamically defined), we should be good.

Let me have a go at a change.

However, note that Vladimir was answering a slightly different question.
I was asking about listing ALL contexts (previously an empty query
string, now a zero length list of queries), not a 'backup:*' type
thing.
Wouter Verhelst Dec. 14, 2016, 6:49 p.m. UTC | #14
On Wed, Dec 14, 2016 at 06:18:58PM +0000, Alex Bligh wrote:
> 
> > On 14 Dec 2016, at 18:13, Wouter Verhelst <w@uter.be> wrote:
> > 
> > Instead, I would suggest that the spec leave it up to the namespace spec
> > to define what gets listed when. The namespace should still give some
> > indication that a particular *type* of context is available, though;
> > e.g., _LIST_ could return 'backup:snapshot-diff{*}' as well as a list of
> > 'backup:snapshot{snapshot-X}' contexts, where the latter are all
> > available snapshots, and the 'snapshot-diff' bit implies that the whole
> > cartesian product of all possible diffs between snapshots is possible in
> > a format of 'backup:snapshot-diff{snapshot-X:snapshot-Y}', or some such
> > (you get the idea).
> > 
> > The spec currently says that a server SHOULD allow choosing exports by
> > simply naming them, but doesn't make that a MUST; and it also says that
> > the query format is defined by the namespace. If we simply clarify that
> > the output of _LIST_ doesn't necessarily need to be a full list of all
> > that's available (that it may be symbolic in the case of namespaces that
> > allow things to be dynamically defined), we should be good.
> 
> Let me have a go at a change.
> 
> However, note that Vladimir was answering a slightly different question.

Actually not :)

> I was asking about listing ALL contexts (previously an empty query
> string, now a zero length list of queries), not a 'backup:*' type
> thing.

What I was trying to say is that I think the result to _LIST_ with no
queries should return all information the client needs to theoretically
build the list of all possible contexts, even if that list may be
so large as to be unfeasible for it to be built (e.g., in case of a
cartesian product between all possible other contexts). I gave one
example, but there may be more.

My point is that if the query includes a namespace, the result should
not be defined by our spec. If the query does not include a namespace,
the result should be "complete" by whatever definition, but not
unreasonable (i.e., don't just write a cartesian product to a client).

This could allow an interactive client to present a user with a list of
possible contexts before performing analysis on the block device, say.
Alex Bligh Dec. 14, 2016, 6:51 p.m. UTC | #15
> On 14 Dec 2016, at 18:18, Alex Bligh <alex@alex.org.uk> wrote:
> 
> Let me have a go at a change.

OK I've pushed a change. I reordered a few bits (to put the
base:allocation next to the stuff that talks about metadata
queries at the start), but the main change is below.
Alex Bligh Dec. 14, 2016, 7:01 p.m. UTC | #16
Wouter,

(Our mails crossed and I've actually pushed something, but no matter)

> On 14 Dec 2016, at 18:49, Wouter Verhelst <w@uter.be> wrote:
> 
> What I was trying to say is that I think the result to _LIST_ with no
> queries should return all information the client needs to theoretically
> build the list of all possible contexts, even if that list may be
> so large as to be unfeasible for it to be built (e.g., in case of a
> cartesian product between all possible other contexts). I gave one
> example, but there may be more.
> 
> My point is that if the query includes a namespace, the result should
> not be defined by our spec. If the query does not include a namespace,
> the result should be "complete" by whatever definition, but not
> unreasonable (i.e., don't just write a cartesian product to a client).
> 
> This could allow an interactive client to present a user with a list of
> possible contexts before performing analysis on the block device, say.

OK, so first of all, one of the changes I made earlier was that now
each of the commands carries a list of queries, the way you list
everything is not 'having a query that doesn't contain a namespace'
but rather doing a _LIST_ with no queries at all. But that's semantics
and orthogonal to the main point.

What I've proposed (and pushed - but feel free to alter it) is that

1. on _LIST_, the server can return fewer contexts than are available
   if returning all of them would consume undue levels of resources.

2. on _LIST_ where the contexts are 'algorithmic', the server can
   return e.g. 'X-Backup:' rather than 'X-Backup:modified>' and
   every integer.

3. On _SET_ if too many contexts are requested, the server may return
   an error (I think we need this anyway).

That nearly does what you ask for, but I'm not sure how you any query
could 'return all the information the client needs to build
the list of all possible contexts'. For instance, in my backup
example 'X-Backup:modified>[integer]' doesn't itself tell you
anything, as you don't know whether the integer is a unix
date time, in seconds after the epoch, milliseconds or whatever.
What, surely, as a client you want to know is 'does it support
the X-Backup: extension because I've read the spec for that and
know that it has X-Backup:modified if so'. So I've suggested it
return 'X-Backup:' only in that case, in which case from that
(*and the spec*) you know how to build any query.
Wouter Verhelst Dec. 14, 2016, 8:10 p.m. UTC | #17
On Wed, Dec 14, 2016 at 07:01:15PM +0000, Alex Bligh wrote:
> Wouter,
> 
> (Our mails crossed and I've actually pushed something, but no matter)
> 
> > On 14 Dec 2016, at 18:49, Wouter Verhelst <w@uter.be> wrote:
> > 
> > What I was trying to say is that I think the result to _LIST_ with no
> > queries should return all information the client needs to theoretically
> > build the list of all possible contexts, even if that list may be
> > so large as to be unfeasible for it to be built (e.g., in case of a
> > cartesian product between all possible other contexts). I gave one
> > example, but there may be more.
> > 
> > My point is that if the query includes a namespace, the result should
> > not be defined by our spec. If the query does not include a namespace,
> > the result should be "complete" by whatever definition, but not
> > unreasonable (i.e., don't just write a cartesian product to a client).
> > 
> > This could allow an interactive client to present a user with a list of
> > possible contexts before performing analysis on the block device, say.
> 
> OK, so first of all, one of the changes I made earlier was that now
> each of the commands carries a list of queries, the way you list
> everything is not 'having a query that doesn't contain a namespace'
> but rather doing a _LIST_ with no queries at all. But that's semantics
> and orthogonal to the main point.
> 
> What I've proposed (and pushed - but feel free to alter it) is that
> 
> 1. on _LIST_, the server can return fewer contexts than are available
>    if returning all of them would consume undue levels of resources.
> 
> 2. on _LIST_ where the contexts are 'algorithmic', the server can
>    return e.g. 'X-Backup:' rather than 'X-Backup:modified>' and
>    every integer.
> 
> 3. On _SET_ if too many contexts are requested, the server may return
>    an error (I think we need this anyway).
> 
> That nearly does what you ask for, but I'm not sure how you any query
> could 'return all the information the client needs to build
> the list of all possible contexts'. For instance, in my backup
> example 'X-Backup:modified>[integer]' doesn't itself tell you
> anything, as you don't know whether the integer is a unix
> date time, in seconds after the epoch, milliseconds or whatever.
> What, surely, as a client you want to know is 'does it support
> the X-Backup: extension because I've read the spec for that and
> know that it has X-Backup:modified if so'. So I've suggested it
> return 'X-Backup:' only in that case, in which case from that
> (*and the spec*) you know how to build any query.

Actually, it does do what I ask for :-)

A client which knows about spec X, Y, or Z, should get all the
information from a _LIST_ command that it needs in order to know whether
or not it can request a particular context or not. E.g., in the case of
a "modified more recent than X seconds ago", the _LIST_ command should
expose (somehow, defined by the spec of that context) that it supports
that particular syntax, without requiring a list of all possible
integers between 0 and 2^64 (for obvious reasons).

This does require that the client know about a particular spec before it
can query it, but that's going to be the case anyway -- a client should
have no business asking for a context of which it has no implementation,
since that would mean it is asking the server to send it information
that it is then going to throw away anyway, which makes no sense.

Additionally, this also allows a client which sees that no contexts were
returned for one of the queries in its _SET_ command to differentiate
beween "server does not support metadata context XYZ" and "server does
support metadata context XYZ, but there is no relevant metadata context
for the query I sent". Allowing a server to limit the information it
sends at an arbitrary cut-off point of "20" or whatever would *not* do
that, so I'm against it in the strongest of terms.

Put otherwise: the information sent in return of a _LIST_ command with
no query string (i.e., the command that asks for "everything") should be
"complete" in the sense that it should allow a client to know whether a
server has support for metadata context X, Y, or Z, even if that means
it may have to issue further _LIST_ queries later on, or if it may have
to combine some other information first.

In the absense of dynamic namespaces, that is most easily implemented by
just listing all metadata contexts in all namespaces that we know about,
but it doesn't *have* to be that.
Wouter Verhelst Dec. 14, 2016, 8:18 p.m. UTC | #18
On Wed, Dec 14, 2016 at 06:51:48PM +0000, Alex Bligh wrote:
> 
> > On 14 Dec 2016, at 18:18, Alex Bligh <alex@alex.org.uk> wrote:
> > 
> > Let me have a go at a change.
> 
> OK I've pushed a change. I reordered a few bits (to put the
> base:allocation next to the stuff that talks about metadata
> queries at the start), but the main change is below.
> 
> -- 
> Alex Bligh
> 
> 
> 
> @@ -970,15 +1029,38 @@ of the newstyle negotiation.
>         - String, query to list a subset of the available metadata
>           contexts.
> 
> -    If zero queries are sent, then the server MUST return all
> -    the metadata contexts that are available to the client to select
> -    on the given export with `NBD_OPT_SET_META_CONTEXT`.
> -
>      For details on the query string, see under `NBD_OPT_SET_META_CONTEXT`.
> 
>      The server MUST either reply with an error (for instance `EINVAL`
>      if the option is not supported), or reply with a list of
>      `NBD_REP_META_CONTEXT` replies followed by `NBD_REP_ACK`.
> +
> +    If zero queries are sent, then the server MUST return all
> +    the metadata contexts that are available to the client to select
> +    on the given export with `NBD_OPT_SET_META_CONTEXT`, save that:
> +
> +    If one or more queries are sent, then the server MUST return
> +    those metadata contexts that are available to the client to
> +    select on the given export with `NBD_OPT_SET_META_CONTEXT`,
> +    and which match one or more of the queries given. The
> +    support of wildcarding within the leaf-name portion of
> +    the query string is dependent upon the namespace.
> +
> +    In either case, however:
> +
> +    * the server MAY return an incomplete list if returning a
> +      complete list would require undue resources.

This part is a bad idea, as per my other mail. Returning a complete list
should not be optional, but the definition of "complete" doesn't have to
imply "list all possible context names".

> +    * the server MAY return a context consisting of a namespace and
> +      a colon only (i.e. omitting the leaf-name) to indicate that
> +      the namespace contains a large number of possible contexts
> +      within that namespace (for instance a namespace `X-backup` with
> +      contexts that indicate whether blocks were written after
> +      a given date might accept queries of the form
> +      `'X-backup:modifiedtime>[unixdate]'` where `[unixdate]` is an
> +      arbitrary integer, and in this case it might simply
> +      return `X-backup:`)

This is way too detailed, I think. It should just allow namespaces to
define what _LIST_ may return, as long as the client is somehow able to
distill (through knowledge of the spec as well as the information sent
in reply to the _LIST_ command) all the metadata contexts it can
possibly select.

>      The metadata context ID in these replies is reserved and SHOULD be
>      set to zero; clients MUST disregard it.
> 
> @@ -1009,7 +1091,9 @@ of the newstyle negotiation.
> 
>      If zero queries are sent, the server MUST select no metadata contexts.
> 
> -    The server MUST reply with a number of `NBD_REP_META_CONTEXT`
> +    The server MAY return `NBD_REP_ERR_TOO_BIG` if a request
> +    seeks to select too many contexts. Otherwise
> +    the server MUST reply with a number of `NBD_REP_META_CONTEXT`

This part makes sense.

>      replies, one for each selected metadata context, each with a unique
>      metadata context ID, followed by `NBD_REP_ACK`. The metadata context
>      ID is transient and may vary across calls to `NBD_OPT_SET_META_CONTEXT`;
John Snow Dec. 14, 2016, 8:47 p.m. UTC | #19
On 12/14/2016 12:09 PM, Wouter Verhelst wrote:
> On Wed, Dec 14, 2016 at 03:08:40PM +0000, Alex Bligh wrote:
>> (NB: I've already applied this and pushed it)
> 
> Thanks.
> 
>> * Change NBD_OPT_LIST_METADATA etc. to explicitly send a list of queries
>>   and add a count of queries so we can extend the command later (rather than
>>   rely on the length of option)
> 
> Sure, that works.
> 
>> * For NBD_OPT_LIST_METADATA make absence of any query (as opposed to zero
>>   length query) list all contexts, as absence of any query is now simple.
>>
>> * Move definition of namespaces in the document to somewhere more appopriate.
>>
>> * Various other minor changes as discussed on the mailing list
> 
> Right. I think we're getting close to a good spec now, for this thing.
> 
> One thing I've been thinking about that we might want to add:
> 
> There may be cases where a server, in performing the required calls to
> be able to handle a BLOCK_STATUS request, will end up with more
> information than the client asked; e.g., if the client asked information
> in the base:allocation context on an extent at offset X of length Y,
> then the server might conceivably do an lseek(SEEK_DATA) and/or
> lseek(SEEK_HOLE). The result of that call might be that the file offset
> will now point to a location Z, where Z > (X+Y).
> 
> Currently, our spec says "the sum of the *length* fields MUST not be
> greater than the overall *length* of request". This means that in
> essense, the server then has to throw away the information it has on the
> range Z - (X + Y). In case a client was interested in that information,
> that seems like a waste. I would therefore like to remove that
> requirement, by rephrasing that "sum of the *length* fields" thing into
> something along the following lines:
> 
>   In case the server returns N extents, the sum of the *length* fields
>   of the first N-1 extents MUST NOT be greater than the overall *length*
>   of the request. The final extent MAY exceed the length of the request
>   if the server has that information anyway as a side effect of looking
>   up the required information and wishes to share it.
> 
> This would then result in the fact that the "length" field in the
> BLOCK_STATUS command would be little more than a hint, since we're
> saying that a server can return more data than requested (if it's
> available anyway) and less data than requested (if it would be too
> resource-intensive to provide all the information). Not sure whether all
> that makes much sense anymore, but hey.
> 
> In addition, the combination of a server providing more information than
> requested with a "REQ_ONE" flag and a length field of zero could be an
> interesting way to enumerate a whole export, too. Essentially, we could
> define that as a client saying "I'm interested in what the size of the
> extent at offset X is, and what its properties are".
> 
> Thoughts?
> 

No complaints here. Even though it becomes a bit of a hint, it still
establishes some limitations. It's good enough, I think.

--js
Alex Bligh Dec. 15, 2016, 10:04 a.m. UTC | #20
> On 14 Dec 2016, at 20:18, Wouter Verhelst <w@uter.be> wrote:
> 
>> +    * the server MAY return a context consisting of a namespace and
>> +      a colon only (i.e. omitting the leaf-name) to indicate that
>> +      the namespace contains a large number of possible contexts
>> +      within that namespace (for instance a namespace `X-backup` with
>> +      contexts that indicate whether blocks were written after
>> +      a given date might accept queries of the form
>> +      `'X-backup:modifiedtime>[unixdate]'` where `[unixdate]` is an
>> +      arbitrary integer, and in this case it might simply
>> +      return `X-backup:`)
> 
> This is way too detailed, I think. It should just allow namespaces to
> define what _LIST_ may return, as long as the client is somehow able to
> distill (through knowledge of the spec as well as the information sent
> in reply to the _LIST_ command) all the metadata contexts it can
> possibly select.


OK, this part now reads:

    The server MUST either reply with an error (for instance `EINVAL`
    if the option is not supported), or reply with a list of
    `NBD_REP_META_CONTEXT` replies followed by `NBD_REP_ACK`.

    If zero queries are sent, then the server MUST return all
    the metadata contexts that are available to the client to select
    on the given export with `NBD_OPT_SET_META_CONTEXT`, save that:

    If one or more queries are sent, then the server MUST return
    those metadata contexts that are available to the client to
    select on the given export with `NBD_OPT_SET_META_CONTEXT`,
    and which match one or more of the queries given. The
    support of wildcarding within the leaf-name portion of
    the query string is dependent upon the namespace.

    In either case, however, for any given namespace the
    server MAY, instead of exhaustively listing every
    matching context available to select (or every context
    available to select where no query is given), send
    sufficient context records back to allow a client with
    knowledge of the namespace to select any context. Each
    namespace returned MUST still satisfy the rules for
    namespaces (i.e. they must begin with the relevant
    namespace, followed	by a colon, then printable non-whitespace
    UTF-8 characters, with the entire string not exceeding
    255	bytes). This may be helpful where a client can
    construct algorithmic queries. For instance, a client might
    reply simply with the namespace with no leaf-name (e.g.
    'X-FooBar:') or with a range of values (e.g.
    'X-ModifiedDate:20160310-20161214'). The semantics of
    such a reply are a matter for the definition of the
    namespace.

Hope that works.
Wouter Verhelst Dec. 15, 2016, 3:03 p.m. UTC | #21
Hi Alex,

On Thu, Dec 15, 2016 at 10:04:05AM +0000, Alex Bligh wrote:
> 
> > On 14 Dec 2016, at 20:18, Wouter Verhelst <w@uter.be> wrote:
> > 
> >> +    * the server MAY return a context consisting of a namespace and
> >> +      a colon only (i.e. omitting the leaf-name) to indicate that
> >> +      the namespace contains a large number of possible contexts
> >> +      within that namespace (for instance a namespace `X-backup` with
> >> +      contexts that indicate whether blocks were written after
> >> +      a given date might accept queries of the form
> >> +      `'X-backup:modifiedtime>[unixdate]'` where `[unixdate]` is an
> >> +      arbitrary integer, and in this case it might simply
> >> +      return `X-backup:`)
> > 
> > This is way too detailed, I think. It should just allow namespaces to
> > define what _LIST_ may return, as long as the client is somehow able to
> > distill (through knowledge of the spec as well as the information sent
> > in reply to the _LIST_ command) all the metadata contexts it can
> > possibly select.
> 
> 
> OK, this part now reads:
> 
>     The server MUST either reply with an error (for instance `EINVAL`
>     if the option is not supported), or reply with a list of
>     `NBD_REP_META_CONTEXT` replies followed by `NBD_REP_ACK`.
> 
>     If zero queries are sent, then the server MUST return all
>     the metadata contexts that are available to the client to select
>     on the given export with `NBD_OPT_SET_META_CONTEXT`, save that:

This reads a bit awkward. I would do:

s/save that:/except as explained below/

>     If one or more queries are sent, then the server MUST return
>     those metadata contexts that are available to the client to
>     select on the given export with `NBD_OPT_SET_META_CONTEXT`,
>     and which match one or more of the queries given. The
>     support of wildcarding within the leaf-name portion of
>     the query string is dependent upon the namespace.
> 
>     In either case, however, for any given namespace the
>     server MAY, instead of exhaustively listing every
>     matching context available to select (or every context
>     available to select where no query is given), send
>     sufficient context records back to allow a client with
>     knowledge of the namespace to select any context. Each
>     namespace returned MUST still satisfy the rules for
>     namespaces (i.e. they must begin with the relevant
>     namespace, followed by a colon, then printable non-whitespace
>     UTF-8 characters,

Why restrict to non-whitespace characters? (printable makes sense...)

>     with the entire string not exceeding
>     255 bytes). This may be helpful where a client can
>     construct algorithmic queries. For instance, a client might
>     reply simply with the namespace with no leaf-name (e.g.
>     'X-FooBar:') or with a range of values (e.g.
>     'X-ModifiedDate:20160310-20161214'). The semantics of
>     such a reply are a matter for the definition of the
>     namespace.
> 
> Hope that works.
Alex Bligh Dec. 15, 2016, 4:32 p.m. UTC | #22
Wouter,

> This reads a bit awkward. I would do:
> 
> s/save that:/except as explained below/

Possibly a British English thing. Will fix.

>>    If one or more queries are sent, then the server MUST return
>>    those metadata contexts that are available to the client to
>>    select on the given export with `NBD_OPT_SET_META_CONTEXT`,
>>    and which match one or more of the queries given. The
>>    support of wildcarding within the leaf-name portion of
>>    the query string is dependent upon the namespace.
>> 
>>    In either case, however, for any given namespace the
>>    server MAY, instead of exhaustively listing every
>>    matching context available to select (or every context
>>    available to select where no query is given), send
>>    sufficient context records back to allow a client with
>>    knowledge of the namespace to select any context. Each
>>    namespace returned MUST still satisfy the rules for
>>    namespaces (i.e. they must begin with the relevant
>>    namespace, followed by a colon, then printable non-whitespace
>>    UTF-8 characters,
> 
> Why restrict to non-whitespace characters? (printable makes sense...)

Because the namespaces and leaf-names are already restricted to
non-whitespace characters. I thought having tabs, line feeds,
returns, em-space, en-space etc. was not particularly useful.
I could be persuaded to relent re spaces.
Wouter Verhelst Dec. 15, 2016, 4:49 p.m. UTC | #23
On Thu, Dec 15, 2016 at 04:32:23PM +0000, Alex Bligh wrote:
> Wouter,
> 
> > This reads a bit awkward. I would do:
> > 
> > s/save that:/except as explained below/
> 
> Possibly a British English thing. Will fix.

It's mostly that the "save that:" suggests to me that the exception
follows immediately (i.e., in the next paragraph). As you wrote it, it
doesn't; it shows up in the paragraph after that, with a "however"
clause. This was very confusing; I had to read it several times before I
understood what you wanted to say.

> >>    If one or more queries are sent, then the server MUST return
> >>    those metadata contexts that are available to the client to
> >>    select on the given export with `NBD_OPT_SET_META_CONTEXT`,
> >>    and which match one or more of the queries given. The
> >>    support of wildcarding within the leaf-name portion of
> >>    the query string is dependent upon the namespace.
> >> 
> >>    In either case, however, for any given namespace the
> >>    server MAY, instead of exhaustively listing every
> >>    matching context available to select (or every context
> >>    available to select where no query is given), send
> >>    sufficient context records back to allow a client with
> >>    knowledge of the namespace to select any context. Each
> >>    namespace returned MUST still satisfy the rules for
> >>    namespaces (i.e. they must begin with the relevant
> >>    namespace, followed by a colon, then printable non-whitespace
> >>    UTF-8 characters,
> > 
> > Why restrict to non-whitespace characters? (printable makes sense...)
> 
> Because the namespaces and leaf-names are already restricted to
> non-whitespace characters. I thought having tabs, line feeds,
> returns, em-space, en-space etc. was not particularly useful.
> I could be persuaded to relent re spaces.

I could imagine that the context might include as part of its name a
user-defined bit. If we're going to disallow whitespace, then that would
mean namespaces would have to do all sorts of escaping etc. I don't
think that's a good idea.
Alex Bligh Dec. 15, 2016, 5:34 p.m. UTC | #24
> On 15 Dec 2016, at 16:49, Wouter Verhelst <w@uter.be> wrote:
> 
>> Because the namespaces and leaf-names are already restricted to
>> non-whitespace characters. I thought having tabs, line feeds,
>> returns, em-space, en-space etc. was not particularly useful.
>> I could be persuaded to relent re spaces.
> 
> I could imagine that the context might include as part of its name a
> user-defined bit. If we're going to disallow whitespace, then that would
> mean namespaces would have to do all sorts of escaping etc. I don't
> think that's a good idea.

So to be clear do you want to include all whitespace
everywhere? Or just to the right of the colon in queries?
Wouter Verhelst Dec. 16, 2016, 3:52 p.m. UTC | #25
On Thu, Dec 15, 2016 at 05:34:48PM +0000, Alex Bligh wrote:
> 
> > On 15 Dec 2016, at 16:49, Wouter Verhelst <w@uter.be> wrote:
> > 
> >> Because the namespaces and leaf-names are already restricted to
> >> non-whitespace characters. I thought having tabs, line feeds,
> >> returns, em-space, en-space etc. was not particularly useful.
> >> I could be persuaded to relent re spaces.
> > 
> > I could imagine that the context might include as part of its name a
> > user-defined bit. If we're going to disallow whitespace, then that would
> > mean namespaces would have to do all sorts of escaping etc. I don't
> > think that's a good idea.
> 
> So to be clear do you want to include all whitespace
> everywhere? Or just to the right of the colon in queries?

Just in queries. I agree that in namespace names, it doesn't make much
sense.
Alex Bligh Dec. 16, 2016, 4:25 p.m. UTC | #26
> On 16 Dec 2016, at 15:52, Wouter Verhelst <w@uter.be> wrote:
> 
> On Thu, Dec 15, 2016 at 05:34:48PM +0000, Alex Bligh wrote:
>> 
>>> On 15 Dec 2016, at 16:49, Wouter Verhelst <w@uter.be> wrote:
>>> 
>>>> Because the namespaces and leaf-names are already restricted to
>>>> non-whitespace characters. I thought having tabs, line feeds,
>>>> returns, em-space, en-space etc. was not particularly useful.
>>>> I could be persuaded to relent re spaces.
>>> 
>>> I could imagine that the context might include as part of its name a
>>> user-defined bit. If we're going to disallow whitespace, then that would
>>> mean namespaces would have to do all sorts of escaping etc. I don't
>>> think that's a good idea.
>> 
>> So to be clear do you want to include all whitespace
>> everywhere? Or just to the right of the colon in queries?
> 
> Just in queries. I agree that in namespace names, it doesn't make much
> sense.

I've changed it so the returns from queries can now have spaces to
the right of the colon if they aren't returning complete context names.
Obviously if they are returning context names, by the definition of
leaf-names they can't have spaces in the leaf-name.

The queries themselves can consist of anything (currently),
save that they must begin with a namespace and a colon.
Wouter Verhelst Dec. 17, 2016, 8:34 a.m. UTC | #27
On Fri, Dec 16, 2016 at 04:25:27PM +0000, Alex Bligh wrote:
> 
> > On 16 Dec 2016, at 15:52, Wouter Verhelst <w@uter.be> wrote:
> > 
> > On Thu, Dec 15, 2016 at 05:34:48PM +0000, Alex Bligh wrote:
> >> 
> >>> On 15 Dec 2016, at 16:49, Wouter Verhelst <w@uter.be> wrote:
> >>> 
> >>>> Because the namespaces and leaf-names are already restricted to
> >>>> non-whitespace characters. I thought having tabs, line feeds,
> >>>> returns, em-space, en-space etc. was not particularly useful.
> >>>> I could be persuaded to relent re spaces.
> >>> 
> >>> I could imagine that the context might include as part of its name a
> >>> user-defined bit. If we're going to disallow whitespace, then that would
> >>> mean namespaces would have to do all sorts of escaping etc. I don't
> >>> think that's a good idea.
> >> 
> >> So to be clear do you want to include all whitespace
> >> everywhere? Or just to the right of the colon in queries?
> > 
> > Just in queries. I agree that in namespace names, it doesn't make much
> > sense.
> 
> I've changed it so the returns from queries can now have spaces to
> the right of the colon if they aren't returning complete context names.
> Obviously if they are returning context names, by the definition of
> leaf-names they can't have spaces in the leaf-name.

Sorry, that still doesn't alleviate my concerns.

Why are we restricting what metadata contexts can use for their names?
We don't restrict what export names can use either, and I could imagine
a metadata context wanting to use an export name as a metadata context
name (e.g., "provide a bitmap on the diff between this export and the
export defined by name XYZ").

As soon as we get to the right of the colon, what sits there is defined
by the namespace, not by us. It doesn't feel right to add too many
restrictions.

I think the whole query string should be a "string" as laid out earlier
in the document. Here, it actually *does* make sense for a string to be
very long, because it's not meant to be human-readable.

I've therefore removed that restriction as well as the "255 bytes max"
one that you added, since I don't think they make much sense. That
doesn't mean I can't be convinced otherwise by good arguments, but
they'd have to be very good ones ;-)

> The queries themselves can consist of anything (currently),
> save that they must begin with a namespace and a colon.

Sure.
Alex Bligh Dec. 17, 2016, 9:41 a.m. UTC | #28
> On 17 Dec 2016, at 08:34, Wouter Verhelst <w@uter.be> wrote:
> 
> I've therefore removed that restriction as well as the "255 bytes max"
> one that you added, since I don't think they make much sense. That
> doesn't mean I can't be convinced otherwise by good arguments, but
> they'd have to be very good ones ;-)

But we were at cross purposes. I was talking about the return from
queries, and you've adjusted namespaces (in toto). I've fixed up
your edit slightly, and brought the return for queries in line
with it. I hope this nit is now dead!
Vladimir Sementsov-Ogievskiy Dec. 26, 2016, 2:52 p.m. UTC | #29
14.12.2016 21:17, Alex Bligh wrote:
>> On 14 Dec 2016, at 17:55, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
>>
>>> Hmmm... Well in the '*' case, it's up to the namespace as to how it parses things, so '*' could be prohibited entirely or mean 'return the first 20 matching' or similar. So that's less of a problem. And 'set all' doesn't exist (unlike 'list all').
>>>
>>> I think in the LIST case we should allow the server to omit contexts under certain circumstances, and allow SET to return ETOOBIG.
>>>
>> We can't prohibit '*' as query string as implementation defined. And we shouldn't (I think) define its meaning.
> Those two statements appear to me to contradict each other. The current documentation *does* define the query string (to the right of the colon) as implementation defined. I'm thus not sure what you mean.
>
>> Opposite, we can define, that query must not select more than 20 contexts, but I'm not sure that this is good.
> Each query can select from only one namespace in the current document (Wouter explained why). However, you can specify multiple queries.
>
> I don't think we need to define a hard limit of a particular number.
>
>> So, do you mean
>>
>>   list('backup:modtime>*') -> empty list
> The result of that would depend on however the 'backup' context was defined, meaning it had the options of:
> * ETOOBIG
> * Listing some subset of available contexts (arguably 'none' is a subset)
>
>>   set('backup:modtime>*') -> ETOOBIG
> Again, the result of that would depend on however the 'backup' context was defined, meaning it had the options of:
> * ETOOBIG
> * Listing some subset of available contexts (arguably 'none' is a subset)
> ... and it need not make the same choice as above, but I agree it would be logical for it to do so.
>
> As any form of wildcarding within a query refers to one particular namespace (as a query by its nature specifies a single namespace), we don't have to worry about the way wildcarding works in the standard, as Wouter pointed out. We merely need to provide that ETOOBIG and listing a subset are acceptable responses.
>
> What we do need to decide is what the response to list() (i.e. a list with a zero length list of queries) does. It's currently defined to return every context from every namespace. Options would include
> * ETOOBIG
> * Listing some subset of available contexts (arguably 'none' is a subset)
> * Allowing abbreviation of algorithmically specified contexts (e.g. in this case just returning 'backup:' to represent all available backup contexts)

Shouldn't we add some flags to REP_META_CONTEXT, for client to be 
insure, is returned string a direct context name or some kind of 
wildcard?|| Just a flags field, with one flag defined for now: 
|NBD_REP_META_CONTEXT_LEAF and others reserved.|
Vladimir Sementsov-Ogievskiy Dec. 26, 2016, 3:57 p.m. UTC | #30
14.12.2016 22:01, Alex Bligh wrote:
> Wouter,
>
> (Our mails crossed and I've actually pushed something, but no matter)
>
>> On 14 Dec 2016, at 18:49, Wouter Verhelst <w@uter.be> wrote:
>>
>> What I was trying to say is that I think the result to _LIST_ with no
>> queries should return all information the client needs to theoretically
>> build the list of all possible contexts, even if that list may be
>> so large as to be unfeasible for it to be built (e.g., in case of a
>> cartesian product between all possible other contexts). I gave one
>> example, but there may be more.
>>
>> My point is that if the query includes a namespace, the result should
>> not be defined by our spec. If the query does not include a namespace,
>> the result should be "complete" by whatever definition, but not
>> unreasonable (i.e., don't just write a cartesian product to a client).
>>
>> This could allow an interactive client to present a user with a list of
>> possible contexts before performing analysis on the block device, say.
> OK, so first of all, one of the changes I made earlier was that now
> each of the commands carries a list of queries, the way you list
> everything is not 'having a query that doesn't contain a namespace'
> but rather doing a _LIST_ with no queries at all. But that's semantics
> and orthogonal to the main point.
>
> What I've proposed (and pushed - but feel free to alter it) is that

I think, an overview in "## Metadata querying" should be updated too.

>
> 1. on _LIST_, the server can return fewer contexts than are available
>     if returning all of them would consume undue levels of resources.
>
> 2. on _LIST_ where the contexts are 'algorithmic', the server can
>     return e.g. 'X-Backup:' rather than 'X-Backup:modified>' and
>     every integer.
>
> 3. On _SET_ if too many contexts are requested, the server may return
>     an error (I think we need this anyway).
>
> That nearly does what you ask for, but I'm not sure how you any query
> could 'return all the information the client needs to build
> the list of all possible contexts'. For instance, in my backup
> example 'X-Backup:modified>[integer]' doesn't itself tell you
> anything, as you don't know whether the integer is a unix
> date time, in seconds after the epoch, milliseconds or whatever.
> What, surely, as a client you want to know is 'does it support
> the X-Backup: extension because I've read the spec for that and
> know that it has X-Backup:modified if so'. So I've suggested it
> return 'X-Backup:' only in that case, in which case from that
> (*and the spec*) you know how to build any query.
>
Vladimir Sementsov-Ogievskiy Dec. 27, 2016, 2:09 p.m. UTC | #31
A bit out of topic, but...

> structured replies via `NBD_OPT_STRUCTURED_REPLY`.  Conversely, if
> structured replies are negotiated, the server MUST use a
> structured reply for any response with a payload, and MUST NOT use
> a simple reply for `NBD_CMD_READ` (even for the case of an early
> `EINVAL` due to bad flags), but MAY use either a simple reply or a
> structured reply to all other requests.

What was the reason for it? Why not to negotiate forced structured read 
separately? Actually, this spec forces any server, which wants to 
implement structured reply implement structured read too. But what if it 
don't want to? If it only wants to implement BLOCK_STATUS?

So, what about changing it, to allow BLOCK_STATUS (or other future 
structured replies) without structured read? Structured read is good 
only for sparse formats, when BLOCK_STATUS is more global. I understand, 
that servers may implement simple (and useless) one-chunk structured 
read, but I think that it is better to fix the spec, to not provoke 
servers use such workaround.
Eric Blake Dec. 27, 2016, 3:56 p.m. UTC | #32
On 12/27/2016 08:09 AM, Vladimir Sementsov-Ogievskiy wrote:
> A bit out of topic, but...
> 
>> structured replies via `NBD_OPT_STRUCTURED_REPLY`.  Conversely, if
>> structured replies are negotiated, the server MUST use a
>> structured reply for any response with a payload, and MUST NOT use
>> a simple reply for `NBD_CMD_READ` (even for the case of an early
>> `EINVAL` due to bad flags), but MAY use either a simple reply or a
>> structured reply to all other requests.
> 
> What was the reason for it? Why not to negotiate forced structured read
> separately?

Because the whole reason we (want to) introduce structured replies IS to
fix the inability to do a partial read or an efficient read of zeroes.
The fact that structured reads make other extensions possible is icing
on the cake, but if you are going to implement structured replies at
all, you might as well make reads do it (since reads are mandatory,
while all other commands that utilize structured replies are optional).

> Actually, this spec forces any server, which wants to
> implement structured reply implement structured read too. But what if it
> don't want to? If it only wants to implement BLOCK_STATUS?

We intentionally do not want to permit such a server. Any server that
wants to implement BLOCK_STATUS must also implement structured reads.

> 
> So, what about changing it, to allow BLOCK_STATUS (or other future
> structured replies) without structured read? Structured read is good
> only for sparse formats,

Not true - it is also good for error recovery even on non-sparse
exports.  The existing read command is flawed in that it cannot be
implemented with partial read support - once the server has started
sending data, it MUST finish sending the number of bytes requested by
the client, which means the server MUST either buffer the read up front
(to ensure no read error is possible once the data sending is started),
or MUST disconnect if a read error is detected partway through.  With
structured reads, you can implement much more efficient servers that
start sending the reply right away without buffering, but which can
still error out on a read error partway through.

> when BLOCK_STATUS is more global. I understand,
> that servers may implement simple (and useless) one-chunk structured
> read, but I think that it is better to fix the spec, to not provoke
> servers use such workaround.

To date, we don't know of ANY servers that implement structured replies
at all, whether for structured reads or for BLOCK_STATUS.  I'm working
on qemu patches to make qemu implement both, and it will serve as an
example of how easy or hard it is to implement things.  But I see NO
reason to weaken the spec to allow structured BLOCK_STATUS without
structured reads.
Wouter Verhelst Dec. 28, 2016, 12:18 a.m. UTC | #33
Hi Vladimir,

On Mon, Dec 26, 2016 at 05:52:54PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Shouldn't we add some flags to REP_META_CONTEXT, for client to be insure, is
> returned string a direct context name or some kind of wildcard? Just a flags
> field, with one flag defined for now: NBD_REP_META_CONTEXT_LEAF and others
> reserved.

I think it should be up to the metadata context namespace definition to
define which syntax represents a direct context name and which
represents a wildcard (if the latter are supported).

A client which doesn't know what a given metadata context implements
can't reasonably ask for information from that context anyway (since
then the client wouldn't know what to do with the returned information),
so it doesn't help much to add a flag here.
Alex Bligh Dec. 29, 2016, 4:04 p.m. UTC | #34
> On 28 Dec 2016, at 00:18, Wouter Verhelst <w@uter.be> wrote:
> 
> On Mon, Dec 26, 2016 at 05:52:54PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> Shouldn't we add some flags to REP_META_CONTEXT, for client to be insure, is
>> returned string a direct context name or some kind of wildcard? Just a flags
>> field, with one flag defined for now: NBD_REP_META_CONTEXT_LEAF and others
>> reserved.
> 
> I think it should be up to the metadata context namespace definition to
> define which syntax represents a direct context name and which
> represents a wildcard (if the latter are supported).
> 
> A client which doesn't know what a given metadata context implements
> can't reasonably ask for information from that context anyway (since
> then the client wouldn't know what to do with the returned information),
> so it doesn't help much to add a flag here.

I agree.

Vladimir: if this isn't clear from the text, please suggest a change.
Alex Bligh Dec. 29, 2016, 4:08 p.m. UTC | #35
Vladimir,

> On 26 Dec 2016, at 15:57, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
> 
>> OK, so first of all, one of the changes I made earlier was that now
>> each of the commands carries a list of queries, the way you list
>> everything is not 'having a query that doesn't contain a namespace'
>> but rather doing a _LIST_ with no queries at all. But that's semantics
>> and orthogonal to the main point.
>> 
>> What I've proposed (and pushed - but feel free to alter it) is that
> 
> I think, an overview in "## Metadata querying" should be updated too.

Sorry, which bit needs updating?

The only bit in that more general section that mentions _LIST_ is:

> First, during negotiation, if the client wishes to query metadata
> during transmission, the client MUST select one or more metadata
> contexts with the NBD_OPT_SET_META_CONTEXT command. If needed, the
> client can use NBD_OPT_LIST_META_CONTEXT to list contexts that the
> server supports.

That still seems correct.
Alex Bligh Dec. 29, 2016, 4:14 p.m. UTC | #36
> On 27 Dec 2016, at 14:09, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
> 
> What was the reason for it? Why not to negotiate forced structured read separately? Actually, this spec forces any server, which wants to implement structured reply implement structured read too. But what if it don't want to? If it only wants to implement BLOCK_STATUS?
> 
> So, what about changing it, to allow BLOCK_STATUS (or other future structured replies) without structured read? Structured read is good only for sparse formats, when BLOCK_STATUS is more global. I understand, that servers may implement simple (and useless) one-chunk structured read, but I think that it is better to fix the spec, to not provoke servers use such workaround.

In essence the current reply format is broken, because it does not provide a means of delivering an error mid reply AND expects fixed length replies. Block Status is the poster child for something that benefits from structured replies.

So, BLOCK_STATUS requires structured replies. We thus have the choice between:
* Allowing structured replies to be implemented on a per-command basis
* Mandating that if structured replies are implemented, they are implemented universally

The second option is far simpler, and is what the current structured reply extension sets out. It's also a good nudge to server implementers (me included) to implement structured replies.

Your assumption that structured replies are only good for sparse formats is incorrect. It's good for anything that wants to include sensible error handling. As a server author, error handling at the moment is a pain. Most of us ALREADY break up large read requests (to avoid malloc() of huge amounts of memory), so we already have a read loop. If we don't then the 'one chunk' method is perfectly acceptable. If I implemented structured replies at all, I'd be sharing the code between multiple users in any case.
Vladimir Sementsov-Ogievskiy Dec. 29, 2016, 4:28 p.m. UTC | #37
29.12.2016 19:04, Alex Bligh wrote:
>> On 28 Dec 2016, at 00:18, Wouter Verhelst <w@uter.be> wrote:
>>
>> On Mon, Dec 26, 2016 at 05:52:54PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>>> Shouldn't we add some flags to REP_META_CONTEXT, for client to be insure, is
>>> returned string a direct context name or some kind of wildcard? Just a flags
>>> field, with one flag defined for now: NBD_REP_META_CONTEXT_LEAF and others
>>> reserved.
>> I think it should be up to the metadata context namespace definition to
>> define which syntax represents a direct context name and which
>> represents a wildcard (if the latter are supported).
>>
>> A client which doesn't know what a given metadata context implements
>> can't reasonably ask for information from that context anyway (since
>> then the client wouldn't know what to do with the returned information),
>> so it doesn't help much to add a flag here.
> I agree.
>
> Vladimir: if this isn't clear from the text, please suggest a change.
>

I'm ok with Wouter's arguments
Vladimir Sementsov-Ogievskiy Dec. 29, 2016, 4:35 p.m. UTC | #38
29.12.2016 19:08, Alex Bligh wrote:
> Vladimir,
>
>> On 26 Dec 2016, at 15:57, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
>>
>>> OK, so first of all, one of the changes I made earlier was that now
>>> each of the commands carries a list of queries, the way you list
>>> everything is not 'having a query that doesn't contain a namespace'
>>> but rather doing a _LIST_ with no queries at all. But that's semantics
>>> and orthogonal to the main point.
>>>
>>> What I've proposed (and pushed - but feel free to alter it) is that
>> I think, an overview in "## Metadata querying" should be updated too.
> Sorry, which bit needs updating?
>
> The only bit in that more general section that mentions _LIST_ is:
>
>> First, during negotiation, if the client wishes to query metadata
>> during transmission, the client MUST select one or more metadata
>> contexts with the NBD_OPT_SET_META_CONTEXT command. If needed, the
>> client can use NBD_OPT_LIST_META_CONTEXT to list contexts that the
>> server supports.
> That still seems correct.
>

Hmm, I am not sure. Seems like it was out of topic.. Here the only thing 
may be s/list contexts that the server supports/available list 
contexts/, to be more consistent, but it is not important.
Vladimir Sementsov-Ogievskiy Jan. 11, 2017, 3:31 p.m. UTC | #39
from current version:

 >>> If an error occurs, the server SHOULD set the appropriate error 
code in the error field of an error chunk. However, if the error does 
not involve invalid usage (such as a request beyond the bounds of the 
file), a server MAY reply with a single block status descriptor with 
/length/ matching the requested length, and /status/ of 0 rather than 
reporting the error.

- single block status descriptor for each context? Isn't it 
implementation defined? Or we finally decided to force 0 status to be 
safe default for all contexts? If it is so, it would be better to 
describe this separately. However, personally, I'd prefer to not define 
contexts internal semantics at all.
Alex Bligh Jan. 11, 2017, 7 p.m. UTC | #40
> On 11 Jan 2017, at 15:31, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
> 
> >>> If an error occurs, the server SHOULD set the appropriate error code in the error field of an error chunk. However, if the error does not involve invalid usage (such as a request beyond the bounds of the file), a server MAY reply with a single block status descriptor with length matching the requested length, and status of 0 rather than reporting the error.
> - single block status descriptor for each context? Isn't it implementation defined? Or we finally decided to force 0 status to be safe default for all contexts? If it is so, it would be better to describe this separately. However, personally, I'd prefer to not define contexts internal semantics at all.

I think this is Wouter's wording, but I think 'a status appropriate to the context' would be better. Each context then needs to define what that is. Either that or 'the context's default status' and that should be in the definition of the context.
Vladimir Sementsov-Ogievskiy Jan. 12, 2017, 7:05 a.m. UTC | #41
11.01.2017 22:00, Alex Bligh wrote:
>> On 11 Jan 2017, at 15:31, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
>>
>>>>> If an error occurs, the server SHOULD set the appropriate error code in the error field of an error chunk. However, if the error does not involve invalid usage (such as a request beyond the bounds of the file), a server MAY reply with a single block status descriptor with length matching the requested length, and status of 0 rather than reporting the error.
>> - single block status descriptor for each context? Isn't it implementation defined? Or we finally decided to force 0 status to be safe default for all contexts? If it is so, it would be better to describe this separately. However, personally, I'd prefer to not define contexts internal semantics at all.
> I think this is Wouter's wording, but I think 'a status appropriate to the context' would be better. Each context then needs to define what that is. Either that or 'the context's default status' and that should be in the definition of the context.
>

Yes this is better. But is it actually needed to force contexts have 
some safe default? If context wants it may define such default without 
this requirement.. So, should it be requirement at all?
Vladimir Sementsov-Ogievskiy Jan. 12, 2017, 11:43 a.m. UTC | #42
(Sorry, it may be a bit out of topic, but I hope it is comfortable for 
all that I just comment current version of the feature by answering 
cover letter of last related patch set)

 From |"NBD_OPT_LIST_META_CONTEXT| (9)":
> The server MUST either reply with an error (for instance |EINVAL| if 
> the option is not supported), or reply with a list of 
> |NBD_REP_META_CONTEXT| replies followed by |NBD_REP_ACK|.:

NBD_REP_ERR_UNSUP for the case in braces? Should it be normal option 
reply packet?
Alex Bligh Jan. 12, 2017, 1:11 p.m. UTC | #43
> On 12 Jan 2017, at 07:05, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
> 
> Yes this is better. But is it actually needed to force contexts have some safe default? If context wants it may define such default without this requirement.. So, should it be requirement at all?

I've changed this to:

    of the file), a server MAY reply with a single block status
    descriptor with *length* matching the requested length, rather than
    reporting the error; in this case the context MAY mandate the
    status returned.
Alex Bligh Jan. 12, 2017, 1:16 p.m. UTC | #44
> On 12 Jan 2017, at 11:43, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
> 
> From "NBD_OPT_LIST_META_CONTEXT (9)":
>> The server MUST either reply with an error (for instance EINVAL if the option is not supported), or reply with a list of NBD_REP_META_CONTEXT replies followed by NBD_REP_ACK.:
> NBD_REP_ERR_UNSUP for the case in braces? Should it be normal option reply packet?

Thanks, fixed.
Vladimir Sementsov-Ogievskiy Jan. 13, 2017, 9:48 a.m. UTC | #45
12.01.2017 16:11, Alex Bligh wrote:
>> On 12 Jan 2017, at 07:05, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
>>
>> Yes this is better. But is it actually needed to force contexts have some safe default? If context wants it may define such default without this requirement.. So, should it be requirement at all?
> I've changed this to:
>
>      of the file), a server MAY reply with a single block status
>      descriptor with *length* matching the requested length, rather than
>      reporting the error; in this case the context MAY mandate the
>      status returned.
>
>

Hmm, I don't understand. So, it MAY mandate and therefore MAY NOT do it? 
And what client should think, if server replies with one chunk matching 
the request length and not mandate the status?
Alex Bligh Jan. 13, 2017, 10:29 a.m. UTC | #46
> On 13 Jan 2017, at 09:48, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
> 
> 12.01.2017 16:11, Alex Bligh wrote:
>>> On 12 Jan 2017, at 07:05, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
>>> 
>>> Yes this is better. But is it actually needed to force contexts have some safe default? If context wants it may define such default without this requirement.. So, should it be requirement at all?
>> I've changed this to:
>> 
>>     of the file), a server MAY reply with a single block status
>>     descriptor with *length* matching the requested length, rather than
>>     reporting the error; in this case the context MAY mandate the
>>     status returned.
>> 
>> 
> 
> Hmm, I don't understand. So, it MAY mandate and therefore MAY NOT do it? And what client should think, if server replies with one chunk matching the request length and not mandate the status?

Some contexts may mandate a particular value (so for instance the allocation context might mandate 0).

Some contexts may not mandate a particular value, in which case the interpretation is dependent upon the context (just like any other status value). EG a context which returned an status of 7 if the range contained a prime number, and else 3, could carry on doing that.

As it doesn't make sense to interpret status returns without an understanding of the particular context, we might as well simply extend this to 'beyond the range' returns - as I think you pointed out!
Vladimir Sementsov-Ogievskiy Jan. 16, 2017, 12:26 p.m. UTC | #47
13.01.2017 13:29, Alex Bligh wrote:
>> On 13 Jan 2017, at 09:48, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
>>
>> 12.01.2017 16:11, Alex Bligh wrote:
>>>> On 12 Jan 2017, at 07:05, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
>>>>
>>>> Yes this is better. But is it actually needed to force contexts have some safe default? If context wants it may define such default without this requirement.. So, should it be requirement at all?
>>> I've changed this to:
>>>
>>>      of the file), a server MAY reply with a single block status
>>>      descriptor with *length* matching the requested length, rather than
>>>      reporting the error; in this case the context MAY mandate the
>>>      status returned.
>>>
>>>
>> Hmm, I don't understand. So, it MAY mandate and therefore MAY NOT do it? And what client should think, if server replies with one chunk matching the request length and not mandate the status?
> Some contexts may mandate a particular value (so for instance the allocation context might mandate 0).
>
> Some contexts may not mandate a particular value, in which case the interpretation is dependent upon the context (just like any other status value). EG a context which returned an status of 7 if the range contained a prime number, and else 3, could carry on doing that.
>
> As it doesn't make sense to interpret status returns without an understanding of the particular context, we might as well simply extend this to 'beyond the range' returns - as I think you pointed out!
>

 >>> The status flags are intentionally defined so that a server MAY 
always safely report a status of 0 for any block

- Actually, status flags are _not_ defined. (each context defines it's 
own status flags)
Vladimir Sementsov-Ogievskiy Jan. 20, 2017, 5:04 p.m. UTC | #48
About extents: is 32bit length enough? We will have to send 4096 for 
empty 16tb disk..
Alex Bligh Jan. 20, 2017, 6 p.m. UTC | #49
> On 20 Jan 2017, at 17:04, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
> 
> About extents: is 32bit length enough? We will have to send 4096 for empty 16tb disk..

The nbd protocol uniformly uses 32 bit lengths (for better or for worse). This is baked into the specification heavily.

I'm not sure sending 4,096 items for an empty 16TB disk is any great hardship to be honest.
Eric Blake Jan. 20, 2017, 7:35 p.m. UTC | #50
On 01/20/2017 12:00 PM, Alex Bligh wrote:
> 
>> On 20 Jan 2017, at 17:04, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
>>
>> About extents: is 32bit length enough? We will have to send 4096 for empty 16tb disk..
> 
> The nbd protocol uniformly uses 32 bit lengths (for better or for worse). This is baked into the specification heavily.
> 
> I'm not sure sending 4,096 items for an empty 16TB disk is any great hardship to be honest.

If it truly matters, an idea that has already been floated on the list
is to add a new NBD_CMD_FLAG_SCALE (or some other spelling) that states
that a particular command is sending lengths scaled by a particular
factor (by the block size? obviously it would have to be better
specified).  Under this idea, the scaling factor would allow you to
report larger extents for fewer back-and-forth operations, but it gets
tricky if the scaling is allowed to get larger than the minimum
granularity between extent changes.

The other idea that has been floated is a way to report whether the
entire export is known to be all-zero content at the time the client
connects, which would trade 4096+ queries (you'd actually have to do
more than 4096 queries, since length is < 4G, not <= 4G) with a single
piece of information at the time the client connects.

Either way, discussion on such enhancements are probably worth a new thread.
Wouter Verhelst Jan. 21, 2017, 12:25 p.m. UTC | #51
On Fri, Jan 20, 2017 at 01:35:10PM -0600, Eric Blake wrote:
> On 01/20/2017 12:00 PM, Alex Bligh wrote:
> > 
> >> On 20 Jan 2017, at 17:04, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
> >>
> >> About extents: is 32bit length enough? We will have to send 4096 for empty 16tb disk..
> > 
> > The nbd protocol uniformly uses 32 bit lengths (for better or for worse). This is baked into the specification heavily.
> > 
> > I'm not sure sending 4,096 items for an empty 16TB disk is any great hardship to be honest.
> 
> If it truly matters, an idea that has already been floated on the list
> is to add a new NBD_CMD_FLAG_SCALE (or some other spelling) that states
> that a particular command is sending lengths scaled by a particular
> factor (by the block size? obviously it would have to be better
> specified).  Under this idea, the scaling factor would allow you to
> report larger extents for fewer back-and-forth operations, but it gets
> tricky if the scaling is allowed to get larger than the minimum
> granularity between extent changes.

Right, but people objected to it on grounds of it being too complicated
(which I think was a fair point).

> The other idea that has been floated is a way to report whether the
> entire export is known to be all-zero content at the time the client
> connects, which would trade 4096+ queries (you'd actually have to do
> more than 4096 queries, since length is < 4G, not <= 4G) with a single
> piece of information at the time the client connects.

That's pretty special-case, and I objected to it on those grounds :-)

However, I don't think there's anything necessarily wrong in changing
the size of the length field in the reply header of the
NBD_REPLY_TYPE_BLOCK_STATUS packet. That way, combined with the fact
that we'd already stated that a server may send more information than
was requested, a client could ask for metadata on an export, and if the
extent is the whole size of the export, the server could say so in a
single reply for all export sizes.

I suppose it'd be slightly odd to have the request use 32-bit lengths
and the response be expressed in 64-bit ones, but I suppose that's the
price we pay to remain a backwards compatible and not too complicated a
protocol.

> Either way, discussion on such enhancements are probably worth a new
> thread.

Sure.
Eric Blake Feb. 16, 2018, 12:35 p.m. UTC | #52
Reviving an old thread, to bring up questions based on a new push to 
implement this extension in qemu.

On 12/14/2016 09:08 AM, Alex Bligh wrote:
> 
> * Change NBD_OPT_LIST_METADATA etc. to explicitly send a list of queries
>    and add a count of queries so we can extend the command later (rather than
>    rely on the length of option)
> 
> * For NBD_OPT_LIST_METADATA make absence of any query (as opposed to zero
>    length query) list all contexts, as absence of any query is now simple.
> 
> * Move definition of namespaces in the document to somewhere more appopriate.
> 
> * Various other minor changes as discussed on the mailing list
> 
> Signed-off-by: Alex Bligh <alex@alex.org.uk>
> ---
>   doc/proto.md | 179 +++++++++++++++++++++++++++++++++++++----------------------
>   1 file changed, 112 insertions(+), 67 deletions(-)
> 

>   
> +Metadata contexts are identified by their names. The name MUST
> +consist of a namespace, followed by a colon, followed by a leaf-name.
> +The namespace and the leaf-name must each consist entirely of
> +printable non-whitespace UTF-8 characters other than colons,
> +and be non-empty. The entire name (namespace, colon and leaf-name)
> +MUST NOT exceed 255 bytes (and therefore botht he namespace and
> +leaf-name are guaranteed to be smaller than 255 bytes).
> +
> +Namespaces MUST be consist of one of the following:
> +- `base`, for metadata contexts defined by this document;
> +- `nbd-server`, for metadata contexts defined by the
> +   implementation that accompanies this document (none
> +   currently);
> +- `x-*`, where `*` can be replaced by any random string not

We're inconsistent on whether extensions are 'x-' or 'X-'; we should 
tighten that up.  (No real impact to implementing things - a server 
should just ignore namespaces it doesn't recognize, regardless of how 
that unknown namespace was spelled).

> +   containing colons, for local experiments. This SHOULD NOT be
> +   used by metadata contexts that are expected to be widely used.
> +- A third-party namespace from the list below.

> @@ -932,51 +961,58 @@ of the newstyle negotiation.
> +    If zero queries are sent, then the server MUST return all
> +    the metadata contexts it knows about.
> +
> +    For details on the query string, see under `NBD_OPT_SET_META_CONTEXT`.
> +
> +    The server MUST either reply with an error (for instance `EINVAL`
> +    if the option is not supported), or reply with a list of
> +    `NBD_REP_META_CONTEXT` replies followed by `NBD_REP_ACK`.
> +    The metadata context ID in these replies is reserved and SHOULD be
> +    set to zero; clients MUST disregard it.

The question came up whether the server is required/permitted to 
diagnose bogus queries (a query for "bad" has no colon, and therefore 
cannot represent any namespace - is that required to be an error, or can 
the server silently ignore it and process the rest of the requests?  Can 
the client rely on the server diagnosing bad requests, or must it be 
prepared for the server to just ignore bad requests?).

Here's my thinking:

A server implementation may want to vet only the first 5 characters of a 
request (because it only supports the namespace "base:").  If that 
server encounters a client asks for context "X-longname:", that is a 
valid request (so we should NOT reply with an error, but merely ignore 
it as an unknown namespace); but if a client asks for namespace 
"garbage", we have the option of whether to return an error.  But for 
both of those client requests, there was no colon in the first 5 
characters.  For a server to robustly distinguish between the two, we 
have to read the entire request and search for a colon, to decide 
whether the missing colon warrants an error reply.  But a client can 
request a name up to 4k in length (the NBD maximum string) - so taking 
the argument to an extreme, we have to manage a 4k string before making 
our decision.  Reading 5 bytes fits easily into the stack, but reading 
4k bytes onto the stack risks skipping the guard page on an OS with 4k 
page sizes, for less-than-stellar handling on stack overflow; and 
reading one byte at a time to check for colon is a pain compared to just 
deciding after 5 bytes that the rest of the string is irrelevant and 
skipping ahead in the data stream to the next point where any useful 
work might be performed.  Thus, I'm arguing that for ease of server 
implementation, we should permit, but not require, the server to 
diagnose ill-formed requests that do not begin with namespace-colon; but 
that we MUST require that the server ignores unknown but well-formed 
requests rather than treating them as errors.

But there is another alternative as well - instead of returning a 
two-part string where colon is the separator, and where the server must 
parse to locate the colon (or lack thereof), would it make sense to 
instead change the NBD_OPT_{LIST,SET}_META_CONTEXT and 
NBD_REP_META_CONTEXT field layout to have two separate length/strings 
per context (first is length/string for namespace, second is 
length/string for leaf name), where colon is no longer special (it is no 
longer possible for a client to pass an ill-formed request that lacks 
colon), and where the server can immediately tell that 'namespace_length 
!= 5, therefore this is a request for a namespace I don't care about'?


> -    These two fields MAY be repeated as much as is necessary to select all
> -    metadata contexts the client is interested in.
> +    - 32 bits, length of export name.
> +    - String, name of export for which we wish to list metadata
> +      contexts.
> +    - 32 bits, number of queries
> +    - Zero or more queries, each being:
> +       - 32 bits, length of query
> +       - String, query to select metadata contexts. The syntax of this
> +         query is implementation-defined, except that it MUST start with a
> +         namespace and a colon.
> +

Also, is 32 bits as the length of the query really necessary?  Given 
that NBD strings are capped at 4k, a 16-bit length is sufficient.  It's 
not like we have padding to get things to natural alignments (if it were 
a matter of always sending 32-bit or 64-bit quantities on natural 
alignments, we'd have padding in a lot more places).
Vladimir Sementsov-Ogievskiy Feb. 16, 2018, 1:53 p.m. UTC | #53
16.02.2018 15:35, Eric Blake wrote:
> Reviving an old thread, to bring up questions based on a new push to 
> implement this extension in qemu.
>
> On 12/14/2016 09:08 AM, Alex Bligh wrote:
>>
>> * Change NBD_OPT_LIST_METADATA etc. to explicitly send a list of queries
>>    and add a count of queries so we can extend the command later 
>> (rather than
>>    rely on the length of option)
>>
>> * For NBD_OPT_LIST_METADATA make absence of any query (as opposed to 
>> zero
>>    length query) list all contexts, as absence of any query is now 
>> simple.
>>
>> * Move definition of namespaces in the document to somewhere more 
>> appopriate.
>>
>> * Various other minor changes as discussed on the mailing list
>>
>> Signed-off-by: Alex Bligh <alex@alex.org.uk>
>> ---
>>   doc/proto.md | 179 
>> +++++++++++++++++++++++++++++++++++++----------------------
>>   1 file changed, 112 insertions(+), 67 deletions(-)
>>
>
>>   +Metadata contexts are identified by their names. The name MUST
>> +consist of a namespace, followed by a colon, followed by a leaf-name.
>> +The namespace and the leaf-name must each consist entirely of
>> +printable non-whitespace UTF-8 characters other than colons,
>> +and be non-empty. The entire name (namespace, colon and leaf-name)
>> +MUST NOT exceed 255 bytes (and therefore botht he namespace and
>> +leaf-name are guaranteed to be smaller than 255 bytes).
>> +
>> +Namespaces MUST be consist of one of the following:
>> +- `base`, for metadata contexts defined by this document;
>> +- `nbd-server`, for metadata contexts defined by the
>> +   implementation that accompanies this document (none
>> +   currently);
>> +- `x-*`, where `*` can be replaced by any random string not
>
> We're inconsistent on whether extensions are 'x-' or 'X-'; we should 
> tighten that up.  (No real impact to implementing things - a server 
> should just ignore namespaces it doesn't recognize, regardless of how 
> that unknown namespace was spelled).

personally I'm for x- like in qmp, it's just nicer . However I'm sure, 
regardless of the choice, people will avoid both variant fora production 
names)


>
>> +   containing colons, for local experiments. This SHOULD NOT be
>> +   used by metadata contexts that are expected to be widely used.
>> +- A third-party namespace from the list below.
>
>> @@ -932,51 +961,58 @@ of the newstyle negotiation.
>> +    If zero queries are sent, then the server MUST return all
>> +    the metadata contexts it knows about.
>> +
>> +    For details on the query string, see under 
>> `NBD_OPT_SET_META_CONTEXT`.
>> +
>> +    The server MUST either reply with an error (for instance `EINVAL`
>> +    if the option is not supported), or reply with a list of
>> +    `NBD_REP_META_CONTEXT` replies followed by `NBD_REP_ACK`.
>> +    The metadata context ID in these replies is reserved and SHOULD be
>> +    set to zero; clients MUST disregard it.
>
> The question came up whether the server is required/permitted to 
> diagnose bogus queries (a query for "bad" has no colon, and therefore 
> cannot represent any namespace - is that required to be an error, or 
> can the server silently ignore it and process the rest of the 
> requests?  Can the client rely on the server diagnosing bad requests, 
> or must it be prepared for the server to just ignore bad requests?).

What do you mean by 'ignore'? Replying with empty list (if only one 
invalid query)? Hmm. I'm ok with either way.


>
> Here's my thinking:
>
> A server implementation may want to vet only the first 5 characters of 
> a request (because it only supports the namespace "base:").  If that 
> server encounters a client asks for context "X-longname:", that is a 
> valid request (so we should NOT reply with an error, but merely ignore 
> it as an unknown namespace); but if a client asks for namespace 
> "garbage", we have the option of whether to return an error.  But for 
> both of those client requests, there was no colon in the first 5 
> characters.  For a server to robustly distinguish between the two, we 
> have to read the entire request and search for a colon, to decide 
> whether the missing colon warrants an error reply.  But a client can 
> request a name up to 4k in length (the NBD maximum string) - so taking 
> the argument to an extreme, we have to manage a 4k string before 
> making our decision.  Reading 5 bytes fits easily into the stack, but 
> reading 4k bytes onto the stack risks skipping the guard page on an OS 
> with 4k page sizes, for less-than-stellar handling on stack overflow; 
> and reading one byte at a time to check for colon is a pain compared 
> to just deciding after 5 bytes that the rest of the string is 
> irrelevant and skipping ahead in the data stream to the next point 
> where any useful work might be performed.  Thus, I'm arguing that for 
> ease of server implementation, we should permit, but not require, the 
> server to diagnose ill-formed requests that do not begin with 
> namespace-colon; but that we MUST require that the server ignores 
> unknown but well-formed requests rather than treating them as errors.

Completely agree.


>
> But there is another alternative as well - instead of returning a 
> two-part string where colon is the separator, and where the server 
> must parse to locate the colon (or lack thereof), would it make sense 
> to instead change the NBD_OPT_{LIST,SET}_META_CONTEXT and 
> NBD_REP_META_CONTEXT field layout to have two separate length/strings 
> per context (first is length/string for namespace, second is 
> length/string for leaf name), where colon is no longer special (it is 
> no longer possible for a client to pass an ill-formed request that 
> lacks colon), and where the server can immediately tell that 
> 'namespace_length != 5, therefore this is a request for a namespace I 
> don't care about'?

Good idea. But it would be tricky thing to maintain backward 
compatibility with published versions of virtuozzo product. And finally 
our implementation would be more complex because of this simplification.


>
>
>> -    These two fields MAY be repeated as much as is necessary to 
>> select all
>> -    metadata contexts the client is interested in.
>> +    - 32 bits, length of export name.
>> +    - String, name of export for which we wish to list metadata
>> +      contexts.
>> +    - 32 bits, number of queries
>> +    - Zero or more queries, each being:
>> +       - 32 bits, length of query
>> +       - String, query to select metadata contexts. The syntax of this
>> +         query is implementation-defined, except that it MUST start 
>> with a
>> +         namespace and a colon.
>> +
>
> Also, is 32 bits as the length of the query really necessary? Given 
> that NBD strings are capped at 4k, a 16-bit length is sufficient.  
> It's not like we have padding to get things to natural alignments (if 
> it were a matter of always sending 32-bit or 64-bit quantities on 
> natural alignments, we'd have padding in a lot more places).
>

Hm. Finally, you suggested several changes (including already merged 
56c77720 :( ). Suggestions are logical. But if they will be accepted, we 
(Virtuozzo) will have to invent tricky hard-to-maintain code, to 
distinguish by third factors our already published versions. So, I 
suggest to add some negotiation flag (BLOCK_STATUS_FIXED ?), to 
negotiate your changes (including 56c77720). In upstream we can skip 
supporting BLOCK_STATUS without that special flag, but it will help us 
to distinguish our previous versions reliably.
Eric Blake Feb. 16, 2018, 4:10 p.m. UTC | #54
On 02/16/2018 07:53 AM, Vladimir Sementsov-Ogievskiy wrote:

> 
> Good idea. But it would be tricky thing to maintain backward 
> compatibility with published versions of virtuozzo product. And finally 
> our implementation would be more complex because of this simplification.
> 

> 
> Hm. Finally, you suggested several changes (including already merged 
> 56c77720 :( ). Suggestions are logical. But if they will be accepted, we 
> (Virtuozzo) will have to invent tricky hard-to-maintain code, to 
> distinguish by third factors our already published versions. So, I 
> suggest to add some negotiation flag (BLOCK_STATUS_FIXED ?), to 
> negotiate your changes (including 56c77720). In upstream we can skip 
> supporting BLOCK_STATUS without that special flag, but it will help us 
> to distinguish our previous versions reliably.

So if I understand correctly, you already have an implementation of NBD 
servers in the wild that support a preliminary version of 
NBD_OPT_SET_META_CONTEXT, but which do not necessarily obey the 
semantics in the current extension branch (and which may not be the 
final semantics by the time we promote the extension branch into the NBD 
spec overall).

Also, the Virtuozzo product hasn't been mentioned on the NBD list, and 
while you are preparing patches to the public qemu based on the 
Virtuozzo product, I'm not sure if there is a public repository for the 
existing Virtuozzo product out in the wild.  Is your product using 
NBD_OPT_LIST_META_CONTEXT, or _just_ NBD_OPT_SET_META_CONTEXT?

Now, let's assess the ramifications for changing the current NBD 
proposal, before finalizing it into mainline.  Once it is in mainline, 
all other implementations have to obey the spec; but we want to leave 
ourselves the freedom to tweak the extension prior to that point 
according to things learned while implementing it.  With that said, what 
combinations are you hoping to still support?  And how long do you 
expect to have mismatched versions in the field, or is it something 
where it is relatively easy to upgrade both client and server within a 
short timeframe?

Here's what things look like if we do nothing, and the NBD_OPT side of 
the handshake stays compatible (as I understand it, your complaint about 
commit 56c77720 is that it affects only the transmission side; leaving 
aside for the moment that I have opened the door for even more changes 
that may affect the handshake side).

old client, old server (what is in the field now): The Virtuozzo client 
is able to request status from the Virtuozzo server, no other NBD client 
or server cares because they pre-date block status

new client, old server: Any new client that implements block_status, as 
currently documented, but talks to an existing Virtuozzo server, will 
get garbage replies to NBD_CMD_BLOCK_STATUS (status 5 instead of 
expected status 3).  Virtuozzo's client could be taught to special-case 
things and accept BOTH values for status replies (as long as nothing 
else changes between the old implementation and when NBD finally accepts 
the extension into mainline)

old client, new server: The Virtuozzo client talking to a non-Virtuozzo 
server that understands block_status as currently documented will get 
garbage replies to NBD_CMD_BLOCK_STATUS (status 3 instead of expected 
status 5). The Virtuozzo server cannot be patched to work around this, 
because it has no way to detect a Virtuozzo client differently from any 
other client.

new client, new server: Because everyone follows the same documentation, 
it should be interoperable.


It's ALWAYS risky to put an experimental item into production use; some 
precautions such as using intentionally higher numbers than normal may 
make it easier to identify the experimental versions (that is, if the 
proposal is to implement item '10', the experimental code implements 
item '10010'; then, any changes to what actually gets finalized into the 
semantics for item '10' will not affect continued use of the '10010' 
semantics).  But since you didn't do that, my immediate reaction is that 
adding a 'BLOCK_STATUS_FIXED' to the NBD protocol sounds crazy (fixed 
compared to what? An unpublished early implementation?).  But what we 
CAN do is specify that NBD_OPT_ value 10 is reserved for a (withdrawn) 
experimental extension (similar to how NBD_OPT_ value 4 for PEEK_EXPORT 
is skipped), and make NBD_OPT_SET_META_CONTEXT be 11.  Then you have:

old client, old server: client sends NBD_OPT 10, Virtuozzo server 
replies, and both sides use whatever you implemented even if it differs 
from the final NBD spec

new client, old server: any compliant client sends NBD_OPT 11, Virtuozzo 
server doesn't recognize it, and you lose the ability to query block 
status. Additionally, you can teach the Virtuozzo client to try 11 
first, and if it fails, fall back to trying 10 - the client then has to 
understand both old and new flavors, but can now talk to any version of 
the Virtuozzo server.

old client, new server: the Virtuozzo client sends NBD_OPT 10, a 
compliant server doesn't recognize it, and you don't get to query block 
status, but you are no worse off than for any other server that doesn't 
understand block status.  Additionally, you can teach the Virtuozzo 
server to accept opt 10 in addition to 11; the server now has to 
understand both old and new flavors, but can now talk to any version of 
the Virtuozzo client.

new client, new server: the client sends NBD_OPT 11, the server replies, 
and they communicate according to the spec

Or, we can revert the change in commit 56c77720, and keep 
NBD_REPLY_TYPE_BLOCK_STATUS at 5 (it leaves a hole in the NBD_REPLY_TYPE 
numbering, where 3 and 4 might be filled in by other future extensions, 
or permanently skipped).  This works IF there are no OTHER incompatible 
changes made to the rest of the block status extension as part of 
promoting it to current (where we still haven't finished that debate, 
given my question on whether 32-bit lengths and colon-separated 
namespace:leaf in a single string is the best representation).

So, I'd like some feedback from Alex or Wouter on which alternatives 
seem nicest at this point.
Wouter Verhelst Feb. 28, 2018, 1:08 p.m. UTC | #55
Hi,

Sorry, I forgot to reply to this earlier.

On Fri, Feb 16, 2018 at 10:10:59AM -0600, Eric Blake wrote:
> On 02/16/2018 07:53 AM, Vladimir Sementsov-Ogievskiy wrote:
> > Good idea. But it would be tricky thing to maintain backward
> > compatibility with published versions of virtuozzo product. And finally
> > our implementation would be more complex because of this simplification.
> 
> > Hm. Finally, you suggested several changes (including already merged
> > 56c77720 :( ). Suggestions are logical. But if they will be accepted, we
> > (Virtuozzo) will have to invent tricky hard-to-maintain code, to
> > distinguish by third factors our already published versions.

Hrm. I wasn't aware you'd already published those versions; if you had
told us, we could've reviewed the spec at that point and either merge it
or update it to incorporate whatever changes you think seem like they
are necessary.

Note that the section "Experimental extensions" contains the following
wording:

    In addition to the normative elements of the specification set out
    herein, various experimental non-normative extensions have been
    proposed. These may not be implemented in any known server or
    client, and are subject to change at any point. A full
    implementation may require changes to the specifications, or cause
    the specifications to be withdrawn altogether.

    [...]

    Implementors of these extensions are strongly suggested to contact
    the mailinglist in order to help fine-tune the specifications before
    committing to a particular implementation.

i.e., what's in an "extension-" branch can be described as "we're
thinking about it and it will probably happen as written, but we're not
entirely sure yet". The idea behind this is that we don't want to limit
ourselves to things that haven't been implemented (but that the fact of
"implementing something" causes it to get written into stone, sortof).

This is different from how most standards bodies work, I'm sure, but it
seemed to work so far, and I wish you had let us know that the work you
were doing was about to be reaching customers. Anyway.

[...]
> Also, the Virtuozzo product hasn't been mentioned on the NBD list, and while
> you are preparing patches to the public qemu based on the Virtuozzo product,
> I'm not sure if there is a public repository for the existing Virtuozzo
> product out in the wild.  Is your product using NBD_OPT_LIST_META_CONTEXT,
> or _just_ NBD_OPT_SET_META_CONTEXT?

+1. What's published currently?

[...]
> Or, we can revert the change in commit 56c77720, and keep
> NBD_REPLY_TYPE_BLOCK_STATUS at 5 (it leaves a hole in the NBD_REPLY_TYPE
> numbering, where 3 and 4 might be filled in by other future extensions, or
> permanently skipped).  This works IF there are no OTHER incompatible changes
> made to the rest of the block status extension as part of promoting it to
> current (where we still haven't finished that debate, given my question on
> whether 32-bit lengths and colon-separated namespace:leaf in a single string
> is the best representation).
> 
> So, I'd like some feedback from Alex or Wouter on which alternatives seem
> nicest at this point.

I'm thinking that reverting at least the number change seems like a good
idea. If Vladimir can shed some light on what's been published so far,
we can see what damage has been done and try to limit further damage.

It makes no sense to ignore that the spec has been implemented; the
whole point of writing a spec is so that third parties can implement it
and be compatible. If we ignore that just because there was a
misunderstanding, then I think we're throwing away the kid with the
bathwater.

Since I don't think a gap in numbers is that much of a problem, I'm
happy reverting the renumbering part of 56c77720 and keep it at 5. I'd
also like to know what it is exactly that Virtuozzo implemented, so we
can update the extension-blockstatus (if necessary) and then merge that
to master.

However, for future reference, Vladimir, I would prefer it if you could
give us a heads-up if you're getting close to releasing something to
the public that's still in an experimental branch, so that we can make
updates if necessary and merge it to master.

Thanks,
Eric Blake Feb. 28, 2018, 8:26 p.m. UTC | #56
On 02/28/2018 07:08 AM, Wouter Verhelst wrote:

>> Or, we can revert the change in commit 56c77720, and keep
>> NBD_REPLY_TYPE_BLOCK_STATUS at 5 (it leaves a hole in the NBD_REPLY_TYPE
>> numbering, where 3 and 4 might be filled in by other future extensions, or
>> permanently skipped).  This works IF there are no OTHER incompatible changes
>> made to the rest of the block status extension as part of promoting it to
>> current (where we still haven't finished that debate, given my question on
>> whether 32-bit lengths and colon-separated namespace:leaf in a single string
>> is the best representation).
>>
>> So, I'd like some feedback from Alex or Wouter on which alternatives seem
>> nicest at this point.
> 
> I'm thinking that reverting at least the number change seems like a good
> idea. If Vladimir can shed some light on what's been published so far,
> we can see what damage has been done and try to limit further damage.
> 
> It makes no sense to ignore that the spec has been implemented; the
> whole point of writing a spec is so that third parties can implement it
> and be compatible. If we ignore that just because there was a
> misunderstanding, then I think we're throwing away the kid with the
> bathwater.
> 
> Since I don't think a gap in numbers is that much of a problem, I'm
> happy reverting the renumbering part of 56c77720 and keep it at 5. I'd
> also like to know what it is exactly that Virtuozzo implemented, so we
> can update the extension-blockstatus (if necessary) and then merge that
> to master.

Okay, I've gone ahead and reverted the renumbering; NBD_INFO_BLOCK_SIZE 
is back to 5 (we have a gap in the numbering, but it doesn't hurt). 
There's still the open question of whether we want to keep a single 
string that requires parsing into "namespace:leaf" sections, or to split 
things into two separate strings for ease of handing an unknown 
namespace; and whether a 16-bit length is sufficient instead of burning 
32-bit length fields.  But if we decide that any of those changes are 
worth making, we'd also want to do something else to make it obviously 
different from Virtuozzo's initial implementation (my idea about 
reserving NBD_OPT_ value 10 as unused, and renumbering 
NBD_OPT_SET_META_CONTEXT to 11 along with any other changes we make, is 
still reasonable for that point).

I'd also like to promote extension-structured-reply to current (now that 
it is implemented by qemu), while keeping extension-blockstatus as a 
separate branch for just a bit longer.

I'll post another thread on other possible doc changes that have come up 
during this thread.
Eric Blake Feb. 28, 2018, 8:29 p.m. UTC | #57
On 02/28/2018 02:26 PM, Eric Blake wrote:

> 
> Okay, I've gone ahead and reverted the renumbering; NBD_INFO_BLOCK_SIZE 
> is back to 5

It helps if I don't copy-and-paste the wrong thing.  NBD_INFO_BLOCK_SIZE 
remains 3 (as it has always been), NBD_REPLY_TYPE_BLOCK_STATUS is 
reverted back to 5 (undoing part of commit 56c77720).
Vladimir Sementsov-Ogievskiy March 1, 2018, 9:54 a.m. UTC | #58
Hi,

Sorry too for long delay.

28.02.2018 16:08, Wouter Verhelst wrote:
> Hi,
>
> Sorry, I forgot to reply to this earlier.
>
> On Fri, Feb 16, 2018 at 10:10:59AM -0600, Eric Blake wrote:
>> On 02/16/2018 07:53 AM, Vladimir Sementsov-Ogievskiy wrote:
>>> Good idea. But it would be tricky thing to maintain backward
>>> compatibility with published versions of virtuozzo product. And finally
>>> our implementation would be more complex because of this simplification.
>>> Hm. Finally, you suggested several changes (including already merged
>>> 56c77720 :( ). Suggestions are logical. But if they will be accepted, we
>>> (Virtuozzo) will have to invent tricky hard-to-maintain code, to
>>> distinguish by third factors our already published versions.
> Hrm. I wasn't aware you'd already published those versions; if you had
> told us, we could've reviewed the spec at that point and either merge it
> or update it to incorporate whatever changes you think seem like they
> are necessary.
>
> Note that the section "Experimental extensions" contains the following
> wording:
>
>      In addition to the normative elements of the specification set out
>      herein, various experimental non-normative extensions have been
>      proposed. These may not be implemented in any known server or
>      client, and are subject to change at any point. A full
>      implementation may require changes to the specifications, or cause
>      the specifications to be withdrawn altogether.
>
>      [...]
>
>      Implementors of these extensions are strongly suggested to contact
>      the mailinglist in order to help fine-tune the specifications before
>      committing to a particular implementation.
>
> i.e., what's in an "extension-" branch can be described as "we're
> thinking about it and it will probably happen as written, but we're not
> entirely sure yet". The idea behind this is that we don't want to limit
> ourselves to things that haven't been implemented (but that the fact of
> "implementing something" causes it to get written into stone, sortof).
>
> This is different from how most standards bodies work, I'm sure, but it
> seemed to work so far, and I wish you had let us know that the work you
> were doing was about to be reaching customers. Anyway.

Oh, it's my big mistake, I understand now. Discussion was (and continues 
to be=) too long, so we couldn't wait, but I of course should have 
announce it somehow, or use bigger experimental constants for the protocol.

>
> [...]
>> Also, the Virtuozzo product hasn't been mentioned on the NBD list, and while
>> you are preparing patches to the public qemu based on the Virtuozzo product,
>> I'm not sure if there is a public repository for the existing Virtuozzo
>> product out in the wild.  Is your product using NBD_OPT_LIST_META_CONTEXT,
>> or _just_ NBD_OPT_SET_META_CONTEXT?
> +1. What's published currently?

No, we don't have public repo. But our implementation is similar (in 
protocol) with sent patches. My first published implementation was 
written one year ago, in Feb 2017. I'm afraid I've missed later changes 
of the spec, will check. [hm, it's not very easy to do..]

_LIST_ is implemented in server, but only _SET_ is used in client.


---- checking for changes ----

Comparing 3118b21df (in extension-blockstatus, committed in Jan 2017) 
with current state of extension-blockstatus, don't see any significant 
changes for blockstatus, except command number 5->3, which we decided to 
rollback (thanks for this).


>
> [...]
>> Or, we can revert the change in commit 56c77720, and keep
>> NBD_REPLY_TYPE_BLOCK_STATUS at 5 (it leaves a hole in the NBD_REPLY_TYPE
>> numbering, where 3 and 4 might be filled in by other future extensions, or
>> permanently skipped).  This works IF there are no OTHER incompatible changes
>> made to the rest of the block status extension as part of promoting it to
>> current (where we still haven't finished that debate, given my question on
>> whether 32-bit lengths and colon-separated namespace:leaf in a single string
>> is the best representation).
>>
>> So, I'd like some feedback from Alex or Wouter on which alternatives seem
>> nicest at this point.
> I'm thinking that reverting at least the number change seems like a good
> idea. If Vladimir can shed some light on what's been published so far,
> we can see what damage has been done and try to limit further damage.
>
> It makes no sense to ignore that the spec has been implemented; the
> whole point of writing a spec is so that third parties can implement it
> and be compatible. If we ignore that just because there was a
> misunderstanding, then I think we're throwing away the kid with the
> bathwater.
>
> Since I don't think a gap in numbers is that much of a problem, I'm
> happy reverting the renumbering part of 56c77720 and keep it at 5. I'd
> also like to know what it is exactly that Virtuozzo implemented, so we
> can update the extension-blockstatus (if necessary) and then merge that
> to master.
>
> However, for future reference, Vladimir, I would prefer it if you could
> give us a heads-up if you're getting close to releasing something to
> the public that's still in an experimental branch, so that we can make
> updates if necessary and merge it to master.
>
> Thanks,
>
Eric Blake May 3, 2018, 4:18 p.m. UTC | #59
Reviving this discussion, as it still seems useful to incorporate now 
that BLOCK_STATUS is only recently part of the standard.

On 12/14/2016 11:09 AM, Wouter Verhelst wrote:

> One thing I've been thinking about that we might want to add:
> 
> There may be cases where a server, in performing the required calls to
> be able to handle a BLOCK_STATUS request, will end up with more
> information than the client asked; e.g., if the client asked information
> in the base:allocation context on an extent at offset X of length Y,
> then the server might conceivably do an lseek(SEEK_DATA) and/or
> lseek(SEEK_HOLE). The result of that call might be that the file offset
> will now point to a location Z, where Z > (X+Y).
> 
> Currently, our spec says "the sum of the *length* fields MUST not be
> greater than the overall *length* of request". This means that in
> essense, the server then has to throw away the information it has on the
> range Z - (X + Y). In case a client was interested in that information,
> that seems like a waste. I would therefore like to remove that
> requirement, by rephrasing that "sum of the *length* fields" thing into
> something along the following lines:
> 
>    In case the server returns N extents, the sum of the *length* fields
>    of the first N-1 extents MUST NOT be greater than the overall *length*
>    of the request. The final extent MAY exceed the length of the request
>    if the server has that information anyway as a side effect of looking
>    up the required information and wishes to share it.
> 
> This would then result in the fact that the "length" field in the
> BLOCK_STATUS command would be little more than a hint, since we're
> saying that a server can return more data than requested (if it's
> available anyway) and less data than requested (if it would be too
> resource-intensive to provide all the information). Not sure whether all
> that makes much sense anymore, but hey.
> 
> In addition, the combination of a server providing more information than
> requested with a "REQ_ONE" flag and a length field of zero could be an
> interesting way to enumerate a whole export, too. Essentially, we could
> define that as a client saying "I'm interested in what the size of the
> extent at offset X is, and what its properties are".
> 
> Thoughts?
>
Eric Blake May 3, 2018, 5:26 p.m. UTC | #60
[resend with updated list address and pruning addresses that bounced]

Reviving this discussion, as it still seems useful to incorporate now 
that BLOCK_STATUS is only recently part of the standard.

On 12/14/2016 11:09 AM, Wouter Verhelst wrote:

> One thing I've been thinking about that we might want to add:
> 
> There may be cases where a server, in performing the required calls to
> be able to handle a BLOCK_STATUS request, will end up with more
> information than the client asked; e.g., if the client asked information
> in the base:allocation context on an extent at offset X of length Y,
> then the server might conceivably do an lseek(SEEK_DATA) and/or
> lseek(SEEK_HOLE). The result of that call might be that the file offset
> will now point to a location Z, where Z > (X+Y).
> 
> Currently, our spec says "the sum of the *length* fields MUST not be
> greater than the overall *length* of request". This means that in
> essense, the server then has to throw away the information it has on the
> range Z - (X + Y). In case a client was interested in that information,
> that seems like a waste. I would therefore like to remove that
> requirement, by rephrasing that "sum of the *length* fields" thing into
> something along the following lines:
> 
>    In case the server returns N extents, the sum of the *length* fields
>    of the first N-1 extents MUST NOT be greater than the overall *length*
>    of the request. The final extent MAY exceed the length of the request
>    if the server has that information anyway as a side effect of looking
>    up the required information and wishes to share it.
> 
> This would then result in the fact that the "length" field in the
> BLOCK_STATUS command would be little more than a hint, since we're
> saying that a server can return more data than requested (if it's
> available anyway) and less data than requested (if it would be too
> resource-intensive to provide all the information). Not sure whether all
> that makes much sense anymore, but hey.
> 
> In addition, the combination of a server providing more information than
> requested with a "REQ_ONE" flag and a length field of zero could be an
> interesting way to enumerate a whole export, too. Essentially, we could
> define that as a client saying "I'm interested in what the size of the
> extent at offset X is, and what its properties are".
> 
> Thoughts?
>
Wouter Verhelst May 4, 2018, 6:40 a.m. UTC | #61
On Thu, May 03, 2018 at 12:26:36PM -0500, Eric Blake wrote:
> [resend with updated list address and pruning addresses that bounced]
> 
> Reviving this discussion, as it still seems useful to incorporate now that
> BLOCK_STATUS is only recently part of the standard.

Yeah. I thought we'd incorporated it already at the time, but apparently
not.

I still think this is a good idea, and that we should do it.

> On 12/14/2016 11:09 AM, Wouter Verhelst wrote:
> 
> > One thing I've been thinking about that we might want to add:
> > 
> > There may be cases where a server, in performing the required calls to
> > be able to handle a BLOCK_STATUS request, will end up with more
> > information than the client asked; e.g., if the client asked information
> > in the base:allocation context on an extent at offset X of length Y,
> > then the server might conceivably do an lseek(SEEK_DATA) and/or
> > lseek(SEEK_HOLE). The result of that call might be that the file offset
> > will now point to a location Z, where Z > (X+Y).
> > 
> > Currently, our spec says "the sum of the *length* fields MUST not be
> > greater than the overall *length* of request". This means that in
> > essense, the server then has to throw away the information it has on the
> > range Z - (X + Y). In case a client was interested in that information,
> > that seems like a waste. I would therefore like to remove that
> > requirement, by rephrasing that "sum of the *length* fields" thing into
> > something along the following lines:
> > 
> >    In case the server returns N extents, the sum of the *length* fields
> >    of the first N-1 extents MUST NOT be greater than the overall *length*
> >    of the request. The final extent MAY exceed the length of the request
> >    if the server has that information anyway as a side effect of looking
> >    up the required information and wishes to share it.
> > 
> > This would then result in the fact that the "length" field in the
> > BLOCK_STATUS command would be little more than a hint, since we're
> > saying that a server can return more data than requested (if it's
> > available anyway) and less data than requested (if it would be too
> > resource-intensive to provide all the information). Not sure whether all
> > that makes much sense anymore, but hey.
> > 
> > In addition, the combination of a server providing more information than
> > requested with a "REQ_ONE" flag and a length field of zero could be an
> > interesting way to enumerate a whole export, too. Essentially, we could
> > define that as a client saying "I'm interested in what the size of the
> > extent at offset X is, and what its properties are".
> > 
> > Thoughts?
> > 
> 
> -- 
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.           +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
>
diff mbox

Patch

diff --git a/doc/proto.md b/doc/proto.md
index e03f434..6955c76 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -683,15 +683,20 @@  by other users, so has been moved out of the "experimental" section.
 
 ## Metadata querying
 
-With the availability of sparse storage formats, it is often needed to
-query the status of a particular range and read only those blocks of
-data that are actually present on the block device.
+It is often helpful for the client to be able to query the status
+of a range of blocks. The nature of the status that can be
+queried is in part implementation dependent. For instance,
+the status might represent:
 
-Some storage formats and operations over such formats express a
-concept of data dirtiness. Whether the operation is block device
-mirroring, incremental block device backup or any other operation with
-a concept of data dirtiness, they all share a need to provide a list
-of ranges that this particular operation treats as dirty.
+* in a sparse storage format, whether the relevant blocks are
+  actually present on the backing device for the export; or
+
+* whether the relevant blocks are 'dirty'; some storage formats
+  and operations over such formats express a concept of data dirtiness.
+  Whether the operation is block device mirroring, incremental block
+  device backup or any other operation with a concept of data dirtiness,
+  they all share a need to provide a list of ranges that this
+  particular operation treats as dirty.
 
 To provide such classes of information, the NBD protocol has a generic
 framework for querying metadata; however, its use must first be
@@ -699,9 +704,11 @@  negotiated, and one or more metadata contexts must be selected.
 
 The procedure works as follows:
 
-- First, during negotiation, the client MUST select one or more metadata
+- First, during negotiation, if the client wishes to query metadata
+  during transmission, the client MUST select one or more metadata
   contexts with the `NBD_OPT_SET_META_CONTEXT` command. If needed, the client
-  can use `NBD_OPT_LIST_META_CONTEXT` to list contexts.
+  can use `NBD_OPT_LIST_META_CONTEXT` to list contexts that the server
+  supports.
 - During transmission, a client can then indicate interest in metadata
   for a given region by way of the `NBD_CMD_BLOCK_STATUS` command, where
   *offset* and *length* indicate the area of interest. The server MUST
@@ -715,13 +722,37 @@  The procedure works as follows:
   of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
 
 A client MUST NOT use `NBD_CMD_BLOCK_STATUS` unless it selected a
-nonzero number of metadata contexts during negotiation. Servers SHOULD
-reply to clients doing so anyway with `EINVAL`.
+non-zero number of metadata contexts during negotiation. Servers SHOULD
+reply to clients sending `NBD_CMD_BLOCK_STATUS without
+selecting metadata contexts with `EINVAL`.
 
-The reply to the `NBD_CMD_BLOCK_STATUS` request MUST be sent by a
+The reply to the `NBD_CMD_BLOCK_STATUS` request MUST be sent as a
 structured reply; this implies that in order to use metadata querying,
 structured replies MUST be negotiated first.
 
+Metadata contexts are identified by their names. The name MUST
+consist of a namespace, followed by a colon, followed by a leaf-name.
+The namespace and the leaf-name must each consist entirely of
+printable non-whitespace UTF-8 characters other than colons,
+and be non-empty. The entire name (namespace, colon and leaf-name)
+MUST NOT exceed 255 bytes (and therefore botht he namespace and
+leaf-name are guaranteed to be smaller than 255 bytes).
+
+Namespaces MUST be consist of one of the following:
+- `base`, for metadata contexts defined by this document;
+- `nbd-server`, for metadata contexts defined by the
+   implementation that accompanies this document (none
+   currently);
+- `x-*`, where `*` can be replaced by any random string not
+   containing colons, for local experiments. This SHOULD NOT be
+   used by metadata contexts that are expected to be widely used.
+- A third-party namespace from the list below.
+
+Third-party implementations can register additional namespaces by simple
+request to the mailing-list. The following additional third-party namespaces
+are currently registered:
+* (none)
+
 This standard defines exactly one metadata context; it is called
 `base:allocation`, and it provides information on the basic allocation
 status of extents (that is, whether they are allocated at all in a
@@ -918,9 +949,7 @@  of the newstyle negotiation.
 - `NBD_OPT_LIST_META_CONTEXT` (10)
 
     Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
-    followed by an `NBD_REP_ACK`. If a server replies to such a request
-    with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
-    commands during the transmission phase.
+    followed by an `NBD_REP_ACK`.
 
     If the query string is syntactically invalid, the server SHOULD send
     `NBD_REP_ERR_INVALID`. If the query string is syntactically valid
@@ -932,51 +961,58 @@  of the newstyle negotiation.
     SHOULD send `NBD_REP_ERR_INVALID`.
 
     Data:
-    - 32 bits, length of export name
-    - String, name of export for which we wish to list or select metadata
+    - 32 bits, length of export name.
+    - String, name of export for which we wish to list metadata
       contexts.
-    - 32 bits, length of query
-    - String, query to select a subset of the available metadata
-      contexts. If this is not specified (i.e., the "length of query"
-      field is 0 and no query is sent), then the server MUST send all
-      the metadata contexts it knows about. If specified, this query
-      string MUST start with a name that uniquely identifies a server
-      implementation; e.g., the reference implementation that
-      accompanies this document would support query strings starting
-      with 'nbd-server:'
-
-    The server MUST reply with a list of `NBD_REP_META_CONTEXT` replies,
-    followed by `NBD_REP_ACK`. The metadata context ID in these replies
-    is reserved and SHOULD be set to zero; clients SHOULD disregard it.
+    - 32 bits, number of queries
+    - Zero or more queries, each being:
+       - 32 bits, length of query.
+       - String, query to list a subset of the available metadata
+         contexts.
+
+    If zero queries are sent, then the server MUST return all
+    the metadata contexts it knows about.
+
+    For details on the query string, see under `NBD_OPT_SET_META_CONTEXT`.
+
+    The server MUST either reply with an error (for instance `EINVAL`
+    if the option is not supported), or reply with a list of
+    `NBD_REP_META_CONTEXT` replies followed by `NBD_REP_ACK`.
+    The metadata context ID in these replies is reserved and SHOULD be
+    set to zero; clients MUST disregard it.
 
 - `NBD_OPT_SET_META_CONTEXT` (11)
 
     Change the set of active metadata contexts. Issuing this command
     replaces all previously-set metadata contexts; clients must ensure
-    that all metadata contexts they're interested in are selected with
+    that all metadata contexts they are interested in are selected with
     the final query that they sent.
 
+    A client MUST NOT send `NBD_CMD_BLOCK_STATUS` unless
+    within the negotiation phase it sent `NBD_OPT_SET_META_CONTEXT`
+    at least once, and the final time it was sent, the server
+    responded without an error, and returned at least one metadata
+    context.
+
     Data:
-    - 32 bits, length of query
-    - String, query to select metadata contexts. The syntax of this
-      query is implementation-defined, except that it MUST start with a
-      namespace. This namespace may be one of the following:
-        - `base:`, for metadata contexts defined by this document;
-        - `nbd-server:`, for metadata contexts defined by the
-          implementation that accompanies this document (none
-          currently);
-        - `x-*:`, where `*` can be replaced by any random string not
-          containing colons, for local experiments. This SHOULD NOT be
-          used by metadata contexts that are expected to be widely used.
-        - third-party implementations can register additional
-          namespaces by simple request to the mailinglist.
-
-    These two fields MAY be repeated as much as is necessary to select all
-    metadata contexts the client is interested in.
+    - 32 bits, length of export name.
+    - String, name of export for which we wish to list metadata
+      contexts.
+    - 32 bits, number of queries
+    - Zero or more queries, each being:
+       - 32 bits, length of query
+       - String, query to select metadata contexts. The syntax of this
+         query is implementation-defined, except that it MUST start with a
+         namespace and a colon.
+
+    If zero queries are sent, the server MUST select no metadata contexts.
 
     The server MUST reply with a number of `NBD_REP_META_CONTEXT`
     replies, one for each selected metadata context, each with a unique
-    metadata context ID. It is not an error if a
+    metadata context ID, followed by `NBD_REP_ACK`. The metadata context
+    ID is transient and may vary across calls to `NBD_OPT_SET_META_CONTEXT`;
+    clients MUST therefore treat the ID as an opaque value and not (for
+    instance) cache it between connections. It is not an error if a
     `NBD_OPT_SET_META_CONTEXT` option does not select any metadata
     context, provided the client then does not attempt to issue
     `NBD_CMD_BLOCK_STATUS` commands.
@@ -1071,9 +1107,10 @@  The `base:allocation` metadata context is the basic "allocated at all"
 metadata context. If an extent is marked with `NBD_STATE_HOLE` at that
 context, this means that the given extent is not allocated in the
 backend storage, and that writing to the extent MAY result in the ENOSPC
-error. This supports sparse file semantics on the server side. If a
-server has only one metadata context (the default), then writing to an
-extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.
+error. This supports sparse file semantics on the server side.
+If a server supports the `base:allocation` metadata context, then writing
+to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC
+unless for reasons specified in the definition of another context.
 
 It defines the following flags for the flags field:
 
@@ -1082,20 +1119,21 @@  It defines the following flags for the flags field:
   `ENOSPC` error); if clear, the block is allocated or the server could
   not otherwise determine its status. Note that the use of
   `NBD_CMD_TRIM` is related to this status, but that the server MAY
-  report a hole even where trim has not been requested, and also that a
-  server MAY report metadata even where a trim has been requested.
+  report a hole even where `NBD_CMD_TRIM` has not been requested, and
+  also that a server MAY report that the block is allocated even where
+  `NBD_CMD_TRIM` has been requested.
 - `NBD_STATE_ZERO` (bit 1): if set, the block contents read as all
   zeroes; if clear, the block contents are not known. Note that the use
   of `NBD_CMD_WRITE_ZEROES` is related to this status, but that the
-  server MAY report zeroes even where write zeroes has not been
+  server MAY report zeroes even where `NBD_CMD_WRITE_ZEROES` has not been
   requested, and also that a server MAY report unknown content even
-  where write zeroes has been requested.
+  where `NBD_CMD_WRITE_ZEROES` has been requested.
 
 It is not an error for a server to report that a region of the
 export has both `NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear. The
-contents of such an area is undefined, and may not be stable;
-clients who are aware of the existence of such a region SHOULD NOT
-read it.
+contents of such an area are undefined, and a client
+reading such an area should make no assumption as to its contents
+or stability.
 
 For the `base:allocation` context, the remainder of the flags field is
 reserved. Servers SHOULD set it to all-zero; clients MUST ignore unknown
@@ -1150,9 +1188,10 @@  valid may depend on negotiation during the handshake phase.
    `EOVERFLOW` error chunk, if the request length is too large.
 - bit 3, `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`. If
   set, the client is interested in only one extent per metadata
-  context. If this flag is present, the server SHOULD NOT send metadata
-  on more than one extent in the reply. Clients SHOULD NOT use this flag
-  on multiple requests for successive regions in the export.
+  context. If this flag is present, the server MUST NOT send metadata
+  on more than one extent in the reply. Client implementors should note
+  that using this flag on multiple contiguous requests is likely to be
+  inefficient.
 
 ##### Structured reply flags
 
@@ -1225,7 +1264,7 @@  interpret the "length" bytes of payload.
 
     *length* MUST be 4 + (a positive integer multiple of 8).  This reply
     represents a series of consecutive block descriptors where the sum
-    of the lengths of the descriptors MUST not be greater than the
+    of the length fields within the descriptors MUST not be greater than the
     length of the original request. This chunk type MUST appear exactly
     once per metadata ID in a structured reply.
 
@@ -1236,14 +1275,15 @@  interpret the "length" bytes of payload.
     and is followed by a list of one or more descriptors, each with this
     layout:
 
-        * 32 bits, length (unsigned, MUST NOT be zero)
+        * 32 bits, length of the extent to which the status below
+          applies (unsigned, MUST be non-zero)
         * 32 bits, status flags
 
     If the client used the `NBD_CMD_FLAG_REQ_ONE` flag in the request,
     then every reply chunk MUST NOT contain more than one descriptor.
 
     Even if the client did not use the `NBD_CMD_FLAG_REQ_ONE` flag in
-    its request, the server MAY return less descriptors in the reply
+    its request, the server MAY return fewer descriptors in the reply
     than would be required to fully specify the whole range of requested
     information to the client, if looking up the information would be
     too resource-intensive for the server, so long as at least one
@@ -1462,8 +1502,13 @@  The following request types exist:
 * `NBD_CMD_BLOCK_STATUS` (7)
 
     A block status query request. Length and offset define the range of
-    interest. Clients MUST NOT use this request unless metadata
-    contexts have been negotiated, which in turn requires the client to
+    interest.
+
+    A client MUST NOT send `NBD_CMD_BLOCK_STATUS` unless
+    within the negotiation phase it sent `NBD_OPT_SET_META_CONTEXT`
+    at least once, and the final time it was sent, the server
+    responded without an error, and returned at least one metadata
+    context. This in turn requires the client to
     first negotiate structured replies. For a successful return, the
     server MUST use a structured reply, containing at least one chunk of
     type `NBD_REPLY_TYPE_BLOCK_STATUS`, where the status field of each