diff mbox

[PATCHv2] Strawman proposal for NBD structured replies

Message ID 1459291701-13679-1-git-send-email-alex@alex.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Alex Bligh March 29, 2016, 10:48 p.m. UTC
Here's a strawman for the structured reply section. I haven't
covered negotation.

Also at:
  https://github.com/abligh/nbd/tree/strawman-structured-reply
Markdown at:
  https://github.com/abligh/nbd/blob/strawman-structured-reply/doc/proto.md

Changes since v1:

* Fixed some formatting

* Inserted 3 options for when structured vs unstructured
  replies are used

* Made use of error location clearer. 0xffffffff indicates
  'unknown error position'.

* Minor clarifications

Signed-off-by: Alex Bligh <alex@alex.org.uk>
---
 doc/proto.md | 138 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 134 insertions(+), 4 deletions(-)

Comments

Eric Blake March 29, 2016, 11:17 p.m. UTC | #1
On 03/29/2016 04:48 PM, Alex Bligh wrote:
> Here's a strawman for the structured reply section. I haven't
> covered negotation.
> 
> Also at:
>   https://github.com/abligh/nbd/tree/strawman-structured-reply
> Markdown at:
>   https://github.com/abligh/nbd/blob/strawman-structured-reply/doc/proto.md
> 

> +++ b/doc/proto.md
> @@ -195,15 +195,145 @@ C: 64 bits, offset (unsigned)
>  C: 32 bits, length (unsigned)  
>  C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)
>  
> -The server replies with:
> +Replies take one of two forms. They may either be structured replies,

Hmm, you put your strawman directly in the 'transmission phase' section,
while mine is deferred to the 'Experimental Extensions' section, at
least until we have a reference client and server implementation.

> +or unstructured replies. The server MUST NOT send structured replies
> +unless it has negotiated structured replies with the client using
> +`NBD_OPT_STUCTURED_REPLIES` (??).
> +
> +[Option #1: Subject to that, the server may choose whether it sends
> +any given reply to any given command as a structured reply or an
> +unstructured reply.]

Existing clients are not expecting a structured reply, so at least
SOMETHING must be negotiated before allowing any structured reply.  But
seems a bit awkward to force a client to recognize two different
possible replies to any given command that does not carry data.

> +
> +[Option #2: If this option is negotiated, the server MUST send all
> +replies as structured replies. If the option is not negotiated, the
> +server MUST send all replies as unstructured replies.]

Doable, but slightly more traffic on the wire.  In my v2, that would
mean most other commands reply with NBD_REPLY_TYPE_NONE on success, or
NBD_REPLY_TYPE_ERROR on failure.

> +
> +[Option #3: If this option is negotiated, the server MUST send all
> +replies to command that support structured replies as structured
> +replies (currently `NBD_CMD_READ` only), and all other replies as
> +unstructured replies. If the option is not negotiated, the server MUST
> +send all replies as unstructured replies.]

My v2 went with this approach, for all existing commands, but documented
that future commands (such as the proposed NBD_CMD_GET_LBA_STATUS, or
whatever we name it) may be structured-only.

I also gave the server latitude to reject clients that negotiate a
future structured-reply command without also negotiating structured
reads, which I guess would help if we want to go with option #2 instead
of #3.

> +#### Structured replies
> +
> +A structured reply consists of one or more chunks. The server
> +MUST send exactly one end chunk (identified by
> +the chunk type `NBD_CHUNKTYPE_END`), and this MUST be the final
> +chunk within the reply.

My v2 went with a flag marking the end packet, rather than a specific
type.  I actually ended up with 5 defined reply types, where any of them
could be flagged as the end packet (but where some of them are useful
only as the end packet).

> +
> +Each chunk consists of the following:
> +
> +S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)  

I see you cribbed the number in my original proposal; I got my number
from bash's $RANDOM; I don't know if you have any underlying scheme for
other magic numbers in use (such as consecutive HEX digits from pi, or
something with ASCII meaning, or something in l33t-speak), so I'm fine
if anyone has rationale for why a better magic number would be desirable.

> +S: 32 bits, flags (including type)  
> +S: 64 bits, handle  
> +S: 32 bits, payload length  
> +S: (*length* bytes of payload data)  
> +
> +The flags have the following meanings:
> +
> +* bits 0-7: `NBD_CHUNKTYPE`, an 8 bit unsigned integer
> +* bits 8-31: reserved (server MUST set these to zero)

My v2 separates the flag and type fields.

> +
> +Possible values of `NBD_CHUNKTYPE` are as follows:
> +
> +* 0 = `NBD_CHUNKTYPE_END`: the final chunk
> +* 1 = `NBD_CHUNKTYPE_DATA`: data that has been read
> +* 2 = `NBD_CHUNKTYPE_ZERO`: data that should be considered zero
> +
> +The format of the payload data for each chunk type is as follows:
> +
> +##### `NBD_CHUNKTYPE_END`
> +
> +S: 32 bits, error code or zero for success  
> +S: 64 bits, offset of error (if any)  
> +
> +##### `NBD_CHUNKTYPE_DATA`
> +
> +S: 64 bits, offset of data  
> +S: (*length-8* bytes of data as read)  
> +
> +##### `NBD_CHUNKTYPE_ZERO`
> +
> +S: 64 bits, offset of data  
> +S: 32 bits, number of zeroes to which this corresponds  

Structure wise, our reply types look similar, even though the naming is
different.  I also had two more types (no data, used for success, and a
distinct error without offset type, rather than special-casing -1 as
indeterminate offset).


>  
> -Replies need not be sent in the same order as requests (i.e., requests
> -may be handled by the server asynchronously).
> +The server MUST NOT send chunks that overlap. The server
> +MUST NOT send chunks whose data exceeds the length
> +of data requested (for this purpose counting the data
> +within `NBD_CHUNKTYPE_ZERO` as the number of zero bytes
> +specified therein). The server MUST, in the case of a successesful

successful

> +read send exactly the number of bytes requested (whether
> +represented by `NBD_CHUNKTYPE_DATA` or `NBD_CHUNKTYPE_ZERO`).
> +The server MUST NOT, in the case of an errored read, send
> +more than the number of bytes requested.
> +
> +In order to avoid the burden of reassembly, the client
> +MAY send `NBD_CMD_FLAG_DF`, which instructs the server
> +not to fragment the reply. If this flag is set, the server
> +MUST send either zero or one data chunks and an `NBD_CHUNKTYPE_END`
> +only. Under such circumstances the server MAY error the command
> +with `ETOOBIG` if the length read exceeds [65,536 bytes | the

`ETOOBIG` is not a standard error; my v2 went with the POSIX EOVERFLOW
and defined it to it's Linux value of 75.

> +
> +If more than one data chunk containing an error has been transmitted
> +prior to sending the `NBD_CHUNKTYPE_END`, the server MUST take
> +the second option above, to avoid the client assuming that data
> +chunks which do not contain the offset marked as errored are
> +error-free.

My v2 lets the server send multiple error-with-offset packets in the
reply chain.
Alex Bligh March 30, 2016, 6:59 a.m. UTC | #2
On 30 Mar 2016, at 00:17, Eric Blake <eblake@redhat.com> wrote:
>> 
>> -The server replies with:
>> +Replies take one of two forms. They may either be structured replies,
> 
> Hmm, you put your strawman directly in the 'transmission phase' section,
> while mine is deferred to the 'Experimental Extensions' section, at
> least until we have a reference client and server implementation.

Yeah, my wording was straw man but I think it should go into the main
body of the work. Obviously in that case it wouldn't be *merged* until
we had a working implementation.

The SELECT stuff is a bit different as I am not sure it was intended
to be standardized short term (i.e. it was a longer term experiment
IIRC).

I guess Wouter should be the arbiter of whether he'd prefer to merge
it only when we have an implementation. My concern is that it may
hang around in 'experimental', when it needs properly merging, which
will require re-wordsmithing.

>> +or unstructured replies. The server MUST NOT send structured replies
>> +unless it has negotiated structured replies with the client using
>> +`NBD_OPT_STUCTURED_REPLIES` (??).
>> +
>> +[Option #1: Subject to that, the server may choose whether it sends
>> +any given reply to any given command as a structured reply or an
>> +unstructured reply.]
> 
> Existing clients are not expecting a structured reply, so at least
> SOMETHING must be negotiated before allowing any structured reply.

Agree, hence the 'subject to that' and the previous sentence.

>  But
> seems a bit awkward to force a client to recognize two different
> possible replies to any given command that does not carry data.

A client that supports both types of reply will have routines to
process both anyway. It just means that it must have both available
for both types of reply.

>> +
>> +[Option #2: If this option is negotiated, the server MUST send all
>> +replies as structured replies. If the option is not negotiated, the
>> +server MUST send all replies as unstructured replies.]
> 
> Doable, but slightly more traffic on the wire.  In my v2, that would
> mean most other commands reply with NBD_REPLY_TYPE_NONE on success, or
> NBD_REPLY_TYPE_ERROR on failure.

I think this is now my preference. With your NBD_REPLY_TYPE_ERROR
thing without the error, we are talking 4 bytes of difference I
think.

>> +
>> +[Option #3: If this option is negotiated, the server MUST send all
>> +replies to command that support structured replies as structured
>> +replies (currently `NBD_CMD_READ` only), and all other replies as
>> +unstructured replies. If the option is not negotiated, the server MUST
>> +send all replies as unstructured replies.]
> 
> My v2 went with this approach, for all existing commands, but documented
> that future commands (such as the proposed NBD_CMD_GET_LBA_STATUS, or
> whatever we name it) may be structured-only.
> 
> I also gave the server latitude to reject clients that negotiate a
> future structured-reply command without also negotiating structured
> reads, which I guess would help if we want to go with option #2 instead
> of #3.
> 
>> +#### Structured replies
>> +
>> +A structured reply consists of one or more chunks. The server
>> +MUST send exactly one end chunk (identified by
>> +the chunk type `NBD_CHUNKTYPE_END`), and this MUST be the final
>> +chunk within the reply.
> 
> My v2 went with a flag marking the end packet, rather than a specific
> type.  I actually ended up with 5 defined reply types, where any of them
> could be flagged as the end packet (but where some of them are useful
> only as the end packet).

I think your idea is better here.

>> +
>> +Each chunk consists of the following:
>> +
>> +S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)
> 
> I see you cribbed the number in my original proposal; I got my number
> from bash's $RANDOM; I don't know if you have any underlying scheme for
> other magic numbers in use (such as consecutive HEX digits from pi, or
> something with ASCII meaning, or something in l33t-speak), so I'm fine
> if anyone has rationale for why a better magic number would be desirable.

Yes I did, and no idea!

>> +S: 32 bits, flags (including type)
>> +S: 64 bits, handle
>> +S: 32 bits, payload length
>> +S: (*length* bytes of payload data)
>> +
>> +The flags have the following meanings:
>> +
>> +* bits 0-7: `NBD_CHUNKTYPE`, an 8 bit unsigned integer
>> +* bits 8-31: reserved (server MUST set these to zero)
> 
> My v2 separates the flag and type fields.

Yeah I proposed that later on list. I see you went with 16/16. All
fine by me.

>> +
>> +Possible values of `NBD_CHUNKTYPE` are as follows:
>> +
>> +* 0 = `NBD_CHUNKTYPE_END`: the final chunk
>> +* 1 = `NBD_CHUNKTYPE_DATA`: data that has been read
>> +* 2 = `NBD_CHUNKTYPE_ZERO`: data that should be considered zero
>> +
>> +The format of the payload data for each chunk type is as follows:
>> +
>> +##### `NBD_CHUNKTYPE_END`
>> +
>> +S: 32 bits, error code or zero for success
>> +S: 64 bits, offset of error (if any)
>> +
>> +##### `NBD_CHUNKTYPE_DATA`
>> +
>> +S: 64 bits, offset of data
>> +S: (*length-8* bytes of data as read)
>> +
>> +##### `NBD_CHUNKTYPE_ZERO`
>> +
>> +S: 64 bits, offset of data
>> +S: 32 bits, number of zeroes to which this corresponds
> 
> Structure wise, our reply types look similar, even though the naming is
> different.  I also had two more types (no data, used for success, and a
> distinct error without offset type, rather than special-casing -1 as
> indeterminate offset).

Yours is better.

>> -Replies need not be sent in the same order as requests (i.e., requests
>> -may be handled by the server asynchronously).
>> +The server MUST NOT send chunks that overlap. The server
>> +MUST NOT send chunks whose data exceeds the length
>> +of data requested (for this purpose counting the data
>> +within `NBD_CHUNKTYPE_ZERO` as the number of zero bytes
>> +specified therein). The server MUST, in the case of a successesful
> 
> successful
> 
>> +read send exactly the number of bytes requested (whether
>> +represented by `NBD_CHUNKTYPE_DATA` or `NBD_CHUNKTYPE_ZERO`).
>> +The server MUST NOT, in the case of an errored read, send
>> +more than the number of bytes requested.
>> +
>> +In order to avoid the burden of reassembly, the client
>> +MAY send `NBD_CMD_FLAG_DF`, which instructs the server
>> +not to fragment the reply. If this flag is set, the server
>> +MUST send either zero or one data chunks and an `NBD_CHUNKTYPE_END`
>> +only. Under such circumstances the server MAY error the command
>> +with `ETOOBIG` if the length read exceeds [65,536 bytes | the
> 
> `ETOOBIG` is not a standard error; my v2 went with the POSIX EOVERFLOW
> and defined it to it's Linux value of 75.

Yours is better.

>> +
>> +If more than one data chunk containing an error has been transmitted
>> +prior to sending the `NBD_CHUNKTYPE_END`, the server MUST take
>> +the second option above, to avoid the client assuming that data
>> +chunks which do not contain the offset marked as errored are
>> +error-free.
> 
> My v2 lets the server send multiple error-with-offset packets in the
> reply chain.

This is also better.

--
Alex Bligh
Wouter Verhelst March 30, 2016, 7:33 a.m. UTC | #3
Morning,

On Wed, Mar 30, 2016 at 07:59:15AM +0100, Alex Bligh wrote:
> On 30 Mar 2016, at 00:17, Eric Blake <eblake@redhat.com> wrote:
> >> 
> >> -The server replies with:
> >> +Replies take one of two forms. They may either be structured replies,
> > 
> > Hmm, you put your strawman directly in the 'transmission phase' section,
> > while mine is deferred to the 'Experimental Extensions' section, at
> > least until we have a reference client and server implementation.
> 
> Yeah, my wording was straw man but I think it should go into the main
> body of the work. Obviously in that case it wouldn't be *merged* until
> we had a working implementation.
> 
> The SELECT stuff is a bit different as I am not sure it was intended
> to be standardized short term (i.e. it was a longer term experiment
> IIRC).

Actually, I plan to implement that when I get around to doing STARTTLS
(which I've started work on, but is far from ready).

> I guess Wouter should be the arbiter of whether he'd prefer to merge
> it only when we have an implementation. My concern is that it may
> hang around in 'experimental', when it needs properly merging, which
> will require re-wordsmithing.

I'll merge whatever the outcome of this discussion is in the
experimental section; that way, it won't get lost. I can reformulate
your text to fit there myself if needs be, although obviously having
something that doesn't require such work is preferable. However, I'm
leaning more towards Eric's proposal at this time, because it feels more
mature.

[...]
> >> +Each chunk consists of the following:
> >> +
> >> +S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)
> > 
> > I see you cribbed the number in my original proposal; I got my number
> > from bash's $RANDOM; I don't know if you have any underlying scheme for
> > other magic numbers in use (such as consecutive HEX digits from pi, or
> > something with ASCII meaning, or something in l33t-speak), so I'm fine
> > if anyone has rationale for why a better magic number would be desirable.
> 
> Yes I did, and no idea!

The oldstyle magic ('cliserv_magic') and the request/reply magic numbers
were made by Pavel, who invented NBD. I have no idea how he chose his
numbers, either.

The newstyle magic ('opts_magic') is ASCII for IHAVEOPT. It is used as
the magic number for option haggling, too.

There are no further magic numbers in the NBD protocol.

[...]
> >> +
> >> +Possible values of `NBD_CHUNKTYPE` are as follows:
> >> +
> >> +* 0 = `NBD_CHUNKTYPE_END`: the final chunk
> >> +* 1 = `NBD_CHUNKTYPE_DATA`: data that has been read
> >> +* 2 = `NBD_CHUNKTYPE_ZERO`: data that should be considered zero
> >> +
> >> +The format of the payload data for each chunk type is as follows:
> >> +
> >> +##### `NBD_CHUNKTYPE_END`
> >> +
> >> +S: 32 bits, error code or zero for success
> >> +S: 64 bits, offset of error (if any)
> >> +
> >> +##### `NBD_CHUNKTYPE_DATA`
> >> +
> >> +S: 64 bits, offset of data
> >> +S: (*length-8* bytes of data as read)
> >> +
> >> +##### `NBD_CHUNKTYPE_ZERO`
> >> +
> >> +S: 64 bits, offset of data
> >> +S: 32 bits, number of zeroes to which this corresponds
> > 
> > Structure wise, our reply types look similar, even though the naming is
> > different.  I also had two more types (no data, used for success, and a
> > distinct error without offset type, rather than special-casing -1 as
> > indeterminate offset).
> 
> Yours is better.

I now realize, after having slept over it, that you guys probably meant
for an error-with-offset to only mark bytes that were part of *that*
particular reply chunk to be marked as faulty, which makes more sense
than "the whole request range from that point on", as I was interpreting
it...

[...]
Eric Blake March 30, 2016, 6:43 p.m. UTC | #4
On 03/30/2016 01:33 AM, Wouter Verhelst wrote:
> Morning,
> 
> On Wed, Mar 30, 2016 at 07:59:15AM +0100, Alex Bligh wrote:
>> On 30 Mar 2016, at 00:17, Eric Blake <eblake@redhat.com> wrote:
>>>>
>>>> -The server replies with:
>>>> +Replies take one of two forms. They may either be structured replies,
>>>
>>> Hmm, you put your strawman directly in the 'transmission phase' section,
>>> while mine is deferred to the 'Experimental Extensions' section, at
>>> least until we have a reference client and server implementation.
>>
>> Yeah, my wording was straw man but I think it should go into the main
>> body of the work. Obviously in that case it wouldn't be *merged* until
>> we had a working implementation.
>>
>> The SELECT stuff is a bit different as I am not sure it was intended
>> to be standardized short term (i.e. it was a longer term experiment
>> IIRC).
> 
> Actually, I plan to implement that when I get around to doing STARTTLS
> (which I've started work on, but is far from ready).

On that tangent, I found SELECT slightly ambiguous (particularly since
I'm also considering a proposal to modify NBD_REP_SERVER to expose
alignment details, so it would have to play nicely with SELECT):

Based on normative text alone, we would have:

S: 64 bits, 0x3e889045565a9
S: 32 bits, NBD_OPT_SELECT (6)
S: 32 bits, NBD_REP_SERVER (2)
S: 32 bits, length
S: 32 bits, name length
S: 'name length' bytes, UTF-8 encoded name (is it NUL-terminated?)
S: 'length - name length' bytes, which is UTF-8 encoded message to
display to human

This implies that 'name length' <= 'length - 4', although we don't
actually state that the server MUST NOT send a name length longer than
that.  It might not hurt to also require that length include space for a
NUL terminator (in both the name, and in the optional human-readable
information field), but only if that matches existing practice (if it
does not, then we should document that the client and server are NOT
dealing with NUL-terminated UTF-8 strings, but relying on length instead).

The SELECT text then refines this:

S: 64 bits, 0x3e889045565a9
S: 32 bits, NBD_OPT_SELECT (6)
S: 32 bits, NBD_REP_SERVER (2)
S: 32 bits, length
S: 32 bits, name length
S: 'name length' bytes, UTF-8 encoded name (is it NUL-terminated?)
S: 64 bits, size
S: 16 bits, export flags

but doesn't mention what happens if 'length' > 'name length + 14'.
Presumably, if the server wants to include the same UTF-8 human
information as before, it would go AFTER the 16 bits for the export
field (in other words, the SELECT extension is carving out fields in the
MIDDLE of the payload).

So, once I know for sure about the handling of NUL bytes, I'll probably
try my hand at a patch to clarify the wording in both the normative and
in the SELECT extension.

>> I guess Wouter should be the arbiter of whether he'd prefer to merge
>> it only when we have an implementation. My concern is that it may
>> hang around in 'experimental', when it needs properly merging, which
>> will require re-wordsmithing.
> 
> I'll merge whatever the outcome of this discussion is in the
> experimental section; that way, it won't get lost. I can reformulate
> your text to fit there myself if needs be, although obviously having
> something that doesn't require such work is preferable. However, I'm
> leaning more towards Eric's proposal at this time, because it feels more
> mature.

Okay, I'll do my best to word things in the experimental section so that
we can minimize edits when moving text. Maybe I'll even go so far as to
post a 2-patch series; one with the text in the experimental section for
committing now; the other for moving the text (but NOT to be committed)
to the normative section, for demonstrating the level of effort required.


> 
> I now realize, after having slept over it, that you guys probably meant
> for an error-with-offset to only mark bytes that were part of *that*
> particular reply chunk to be marked as faulty, which makes more sense
> than "the whole request range from that point on", as I was interpreting
> it...

Indeed, anything I can do to make the wording more clear is welcome.
Obviously, a server can mark an entire data chunk invalid by setting the
offset to the same value as the offset used in that chunk; the real
power is that an offset that is > chunk-offset but < 'chunk-offset +
chunk-length' then lets the client know that the first half of chunk was
valid, the second half of chunk is bogus, and no other chunk is
impacted.  And when a lazy server sends error-without-offset, the client
is forced to treat ALL chunks as invalid; which is why I state that the
server SHOULD NOT mix error-with-offset and error-without-offset in the
same reply (SHOULD NOT, but not MUST NOT, because I can't predict every
possible error scenario, and there may be some fatal chain of events
where the server is forced to mix after all).
Eric Blake March 30, 2016, 7:10 p.m. UTC | #5
On 03/30/2016 12:43 PM, Eric Blake wrote:

> On that tangent, I found SELECT slightly ambiguous (particularly since
> I'm also considering a proposal to modify NBD_REP_SERVER to expose
> alignment details, so it would have to play nicely with SELECT):
> 
> Based on normative text alone, we would have:
> 
> S: 64 bits, 0x3e889045565a9
> S: 32 bits, NBD_OPT_SELECT (6)
> S: 32 bits, NBD_REP_SERVER (2)
> S: 32 bits, length
> S: 32 bits, name length
> S: 'name length' bytes, UTF-8 encoded name (is it NUL-terminated?)
> S: 'length - name length' bytes, which is UTF-8 encoded message to
> display to human
> 
> This implies that 'name length' <= 'length - 4', although we don't
> actually state that the server MUST NOT send a name length longer than
> that.  It might not hurt to also require that length include space for a
> NUL terminator (in both the name, and in the optional human-readable
> information field), but only if that matches existing practice (if it
> does not, then we should document that the client and server are NOT
> dealing with NUL-terminated UTF-8 strings, but relying on length instead).
> 
> The SELECT text then refines this:
> 
> S: 64 bits, 0x3e889045565a9
> S: 32 bits, NBD_OPT_SELECT (6)
> S: 32 bits, NBD_REP_SERVER (2)
> S: 32 bits, length
> S: 32 bits, name length
> S: 'name length' bytes, UTF-8 encoded name (is it NUL-terminated?)
> S: 64 bits, size
> S: 16 bits, export flags
> 
> but doesn't mention what happens if 'length' > 'name length + 14'.
> Presumably, if the server wants to include the same UTF-8 human
> information as before, it would go AFTER the 16 bits for the export
> field (in other words, the SELECT extension is carving out fields in the
> MIDDLE of the payload).

Actually, now that I've thought about it, I wonder if it is better to
arrange data so that there is _always_ a UTF-8 human-readable string
immediately after the name (with 1-byte content of "" as the minimum),
so that clients can just blindly print that string, even if binary data
appears later in the structure due to other things (such as
NBD_OPT_SELECT) specifying the layout of that binary data.  But if we go
with that approach, then we should probably document that NBD_REP_SERVER
sent in reply to NBD_OPT_SELECT must have length >= 'name-length + 15'.
 Anywhere that we put binary data where a client used to be expecting
human-readable data risks an unprepared client printing garbage.

> 
> So, once I know for sure about the handling of NUL bytes, I'll probably
> try my hand at a patch to clarify the wording in both the normative and
> in the SELECT extension.

At any rate, I hope that I've brought attention to the issue, and that
part of moving SELECT out of experimental and into normative will
involve documenting decisions you actually make while implementing things.
Wouter Verhelst March 30, 2016, 8:12 p.m. UTC | #6
On Wed, Mar 30, 2016 at 12:43:57PM -0600, Eric Blake wrote:
> On 03/30/2016 01:33 AM, Wouter Verhelst wrote:
> > Morning,
> > 
> > On Wed, Mar 30, 2016 at 07:59:15AM +0100, Alex Bligh wrote:
> >> On 30 Mar 2016, at 00:17, Eric Blake <eblake@redhat.com> wrote:
> >>>>
> >>>> -The server replies with:
> >>>> +Replies take one of two forms. They may either be structured replies,
> >>>
> >>> Hmm, you put your strawman directly in the 'transmission phase' section,
> >>> while mine is deferred to the 'Experimental Extensions' section, at
> >>> least until we have a reference client and server implementation.
> >>
> >> Yeah, my wording was straw man but I think it should go into the main
> >> body of the work. Obviously in that case it wouldn't be *merged* until
> >> we had a working implementation.
> >>
> >> The SELECT stuff is a bit different as I am not sure it was intended
> >> to be standardized short term (i.e. it was a longer term experiment
> >> IIRC).
> > 
> > Actually, I plan to implement that when I get around to doing STARTTLS
> > (which I've started work on, but is far from ready).
> 
> On that tangent, I found SELECT slightly ambiguous (particularly since
> I'm also considering a proposal to modify NBD_REP_SERVER to expose
> alignment details, so it would have to play nicely with SELECT):
> 
> Based on normative text alone, we would have:
> 
> S: 64 bits, 0x3e889045565a9
> S: 32 bits, NBD_OPT_SELECT (6)
> S: 32 bits, NBD_REP_SERVER (2)
> S: 32 bits, length
> S: 32 bits, name length
> S: 'name length' bytes, UTF-8 encoded name (is it NUL-terminated?)
> S: 'length - name length' bytes, which is UTF-8 encoded message to
> display to human

Formally, it defines that as "the rest of the data contains some
undefined implementation-specific details about the export (...) [i]f
the client did not explicitly request otherwise, these details are
defined to be UTF-8 encoded data suitable for direct display to a human
being."

> This implies that 'name length' <= 'length - 4', although we don't
> actually state that the server MUST NOT send a name length longer than
> that.  It might not hurt to also require that length include space for a
> NUL terminator (in both the name, and in the optional human-readable
> information field), but only if that matches existing practice (if it
> does not, then we should document that the client and server are NOT
> dealing with NUL-terminated UTF-8 strings, but relying on length instead).
> 
> The SELECT text then refines this:
> 
> S: 64 bits, 0x3e889045565a9
> S: 32 bits, NBD_OPT_SELECT (6)
> S: 32 bits, NBD_REP_SERVER (2)
> S: 32 bits, length
> S: 32 bits, name length
> S: 'name length' bytes, UTF-8 encoded name (is it NUL-terminated?)
> S: 64 bits, size
> S: 16 bits, export flags
> 
> but doesn't mention what happens if 'length' > 'name length + 14'.

The idea here was that the client in case of the SELECT extension *does*
"explicitly request otherwise", meaning, the details about the export
here are no longer undefined or implementation-specific, but instead
clearly defined to be the "size" plus the "export flags".

> Presumably, if the server wants to include the same UTF-8 human
> information as before, it would go AFTER the 16 bits for the export
> field (in other words, the SELECT extension is carving out fields in the
> MIDDLE of the payload).

Sortof. I can see how you come to that conclusion, but it wasn't meant
as such. Then again, I'll readily admit that I didn't spend quite as
much thinking/discussing about the SELECT extension as compared to the
discussion we're having now :-)

> So, once I know for sure about the handling of NUL bytes, I'll probably
> try my hand at a patch to clarify the wording in both the normative and
> in the SELECT extension.

nbd-server.c does this:

        for(i=0; i<servers->len; i++) {
                SERVER* serve = &(g_array_index(servers, SERVER, i));
                len = htonl(strlen(serve->servename));
                memcpy(buf, &len, sizeof(len));
                strcpy(ptr, serve->servename);
                send_reply(opt, net, NBD_REP_SERVER, strlen(serve->servename)+sizeof(len), buf);
        }

where ptr = buf + sizeof(len).

(on a sidenote, that should really be a strncpy. Let me fix that...
there, done)

so it doesn't send the NUL byte. The nbd-client side ends up adding a
NUL byte after whatever the server sent; I think that makes the most
sense, since a client should do so anyway for data it receives from a
(potentially evil) peer.

[...]
> > I now realize, after having slept over it, that you guys probably meant
> > for an error-with-offset to only mark bytes that were part of *that*
> > particular reply chunk to be marked as faulty, which makes more sense
> > than "the whole request range from that point on", as I was interpreting
> > it...
> 
> Indeed, anything I can do to make the wording more clear is welcome.

You should probably be explicit about which parts of the response are
marked as invalid when an error is sent.

Also, what happens if a server which checks read()'s return value before
forming a header encounters an error? It wouldn't send the broken data
to the client (it knows it's broken), it wouldn't send any padding, so
it would send an error code to the client about an offset that it will
never send?

> Obviously, a server can mark an entire data chunk invalid by setting the
> offset to the same value as the offset used in that chunk; the real
> power is that an offset that is > chunk-offset but < 'chunk-offset +
> chunk-length' then lets the client know that the first half of chunk was
> valid, the second half of chunk is bogus, and no other chunk is
> impacted.

Right. That seems like it's close to a clear spec :-)

> And when a lazy server sends error-without-offset, the client is
> forced to treat ALL chunks as invalid; which is why I state that the
> server SHOULD NOT mix error-with-offset and error-without-offset in
> the same reply (SHOULD NOT, but not MUST NOT, because I can't predict
> every possible error scenario, and there may be some fatal chain of
> events where the server is forced to mix after all).

"someone just ate my filesystem," for instance :-)
diff mbox

Patch

diff --git a/doc/proto.md b/doc/proto.md
index aaae0a2..8c89b1e 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -195,15 +195,145 @@  C: 64 bits, offset (unsigned)
 C: 32 bits, length (unsigned)  
 C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)
 
-The server replies with:
+Replies take one of two forms. They may either be structured replies,
+or unstructured replies. The server MUST NOT send structured replies
+unless it has negotiated structured replies with the client using
+`NBD_OPT_STUCTURED_REPLIES` (??).
+
+[Option #1: Subject to that, the server may choose whether it sends
+any given reply to any given command as a structured reply or an
+unstructured reply.]
+
+[Option #2: If this option is negotiated, the server MUST send all
+replies as structured replies. If the option is not negotiated, the
+server MUST send all replies as unstructured replies.]
+
+[Option #3: If this option is negotiated, the server MUST send all
+replies to command that support structured replies as structured
+replies (currently `NBD_CMD_READ` only), and all other replies as
+unstructured replies. If the option is not negotiated, the server MUST
+send all replies as unstructured replies.]
+
+Unstructured replies are problematic for error handling within
+`NBD_CMD_READ`, therefore servers SHOULD support structured replies.
+
+#### Unstructured replies
+
+In an unstructured reply, the server replies with:
 
 S: 32 bits, 0x67446698, magic (`NBD_REPLY_MAGIC`)  
 S: 32 bits, error  
 S: 64 bits, handle  
-S: (*length* bytes of data if the request is of type `NBD_CMD_READ`)
+S: (*length* bytes of data if the request is of type `NBD_CMD_READ`)  
+
+#### Structured replies
+
+A structured reply consists of one or more chunks. The server
+MUST send exactly one end chunk (identified by
+the chunk type `NBD_CHUNKTYPE_END`), and this MUST be the final
+chunk within the reply.
+
+Each chunk consists of the following:
+
+S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)  
+S: 32 bits, flags (including type)  
+S: 64 bits, handle  
+S: 32 bits, payload length  
+S: (*length* bytes of payload data)  
+
+The flags have the following meanings:
+
+* bits 0-7: `NBD_CHUNKTYPE`, an 8 bit unsigned integer
+* bits 8-31: reserved (server MUST set these to zero)
+
+Possible values of `NBD_CHUNKTYPE` are as follows:
+
+* 0 = `NBD_CHUNKTYPE_END`: the final chunk
+* 1 = `NBD_CHUNKTYPE_DATA`: data that has been read
+* 2 = `NBD_CHUNKTYPE_ZERO`: data that should be considered zero
+
+The format of the payload data for each chunk type is as follows:
+
+##### `NBD_CHUNKTYPE_END`
+
+S: 32 bits, error code or zero for success  
+S: 64 bits, offset of error (if any)  
+
+##### `NBD_CHUNKTYPE_DATA`
+
+S: 64 bits, offset of data  
+S: (*length-8* bytes of data as read)  
+
+##### `NBD_CHUNKTYPE_ZERO`
+
+S: 64 bits, offset of data  
+S: 32 bits, number of zeroes to which this corresponds  
+
+
+Commands that return data (currently `NBD_CMD_READ`) therefore MUST
+return zero or more chunks each of type `NBD_CHUNKTYPE_DATA` or
+`NBD_CHUNKTYPE_ZERO` (collectively 'data chunks') followed
+an `NBD_CHUNKTYPE_END`.
+
+The server MAY split the reply into any non-zero number of data
+chunks (provided each consists of at least one byte) and
+MAY send the data chunks in any order (though the
+`NBD_CHUNKTYPE_END` must be the final chunk). This means the
+client is responsible for reassembling the chunks in the correct
+order.
 
-Replies need not be sent in the same order as requests (i.e., requests
-may be handled by the server asynchronously).
+The server MUST NOT send chunks that overlap. The server
+MUST NOT send chunks whose data exceeds the length
+of data requested (for this purpose counting the data
+within `NBD_CHUNKTYPE_ZERO` as the number of zero bytes
+specified therein). The server MUST, in the case of a successesful
+read send exactly the number of bytes requested (whether
+represented by `NBD_CHUNKTYPE_DATA` or `NBD_CHUNKTYPE_ZERO`).
+The server MUST NOT, in the case of an errored read, send
+more than the number of bytes requested.
+
+In order to avoid the burden of reassembly, the client
+MAY send `NBD_CMD_FLAG_DF`, which instructs the server
+not to fragment the reply. If this flag is set, the server
+MUST send either zero or one data chunks and an `NBD_CHUNKTYPE_END`
+only. Under such circumstances the server MAY error the command
+with `ETOOBIG` if the length read exceeds [65,536 bytes | the
+negotiated maximum fragment size].
+
+If no errors are detected within an operation, the `NBD_CHUNKTYPE_END`
+packet MUST contain an error value of zero and an error offset of
+zero.
+
+If the server detects an error during an operation which it
+is serving with a structured reply, it MUST complete
+the transmission of the errored data chunk(s) if transmission
+has started (by padding the chunk(s) concerned with data
+which MUST be zero), after which zero or more further
+data chunks may be sent, followed by an `NBD_CHUNKTYPE_END`
+chunk. The server MUST either:
+
+* set the offset within `NBD_CHUNKTYPE_END` to the offset of the
+  error, in which case this MUST be within the length requested; or
+* set the offset to 0xffffffff meaning the error occurred at
+  an unidentified place.
+
+If more than one data chunk containing an error has been transmitted
+prior to sending the `NBD_CHUNKTYPE_END`, the server MUST take
+the second option above, to avoid the client assuming that data
+chunks which do not contain the offset marked as errored are
+error-free.
+
+#### Ordering of replies
+
+The server MAY send replies in any order. The order of replies
+need not correpsond to the order of requests, i.e., requests
+may be handled by the server asynchronously). The server MAY
+interleave the chunks relating to a single structured reply
+with chunks relating to structured replies relating to
+a different handle, or with unstructured replies relating
+to a different handle. Note that there is a constraint on
+the ordering of chunks within a given structured reply as set out
+above; this is a separate issue to the ordering of replies.
 
 ## Values