Message ID | 1459291701-13679-1-git-send-email-alex@alex.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 03/29/2016 04:48 PM, Alex Bligh wrote: > Here's a strawman for the structured reply section. I haven't > covered negotation. > > Also at: > https://github.com/abligh/nbd/tree/strawman-structured-reply > Markdown at: > https://github.com/abligh/nbd/blob/strawman-structured-reply/doc/proto.md > > +++ b/doc/proto.md > @@ -195,15 +195,145 @@ C: 64 bits, offset (unsigned) > C: 32 bits, length (unsigned) > C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`) > > -The server replies with: > +Replies take one of two forms. They may either be structured replies, Hmm, you put your strawman directly in the 'transmission phase' section, while mine is deferred to the 'Experimental Extensions' section, at least until we have a reference client and server implementation. > +or unstructured replies. The server MUST NOT send structured replies > +unless it has negotiated structured replies with the client using > +`NBD_OPT_STUCTURED_REPLIES` (??). > + > +[Option #1: Subject to that, the server may choose whether it sends > +any given reply to any given command as a structured reply or an > +unstructured reply.] Existing clients are not expecting a structured reply, so at least SOMETHING must be negotiated before allowing any structured reply. But seems a bit awkward to force a client to recognize two different possible replies to any given command that does not carry data. > + > +[Option #2: If this option is negotiated, the server MUST send all > +replies as structured replies. If the option is not negotiated, the > +server MUST send all replies as unstructured replies.] Doable, but slightly more traffic on the wire. In my v2, that would mean most other commands reply with NBD_REPLY_TYPE_NONE on success, or NBD_REPLY_TYPE_ERROR on failure. > + > +[Option #3: If this option is negotiated, the server MUST send all > +replies to command that support structured replies as structured > +replies (currently `NBD_CMD_READ` only), and all other replies as > +unstructured replies. If the option is not negotiated, the server MUST > +send all replies as unstructured replies.] My v2 went with this approach, for all existing commands, but documented that future commands (such as the proposed NBD_CMD_GET_LBA_STATUS, or whatever we name it) may be structured-only. I also gave the server latitude to reject clients that negotiate a future structured-reply command without also negotiating structured reads, which I guess would help if we want to go with option #2 instead of #3. > +#### Structured replies > + > +A structured reply consists of one or more chunks. The server > +MUST send exactly one end chunk (identified by > +the chunk type `NBD_CHUNKTYPE_END`), and this MUST be the final > +chunk within the reply. My v2 went with a flag marking the end packet, rather than a specific type. I actually ended up with 5 defined reply types, where any of them could be flagged as the end packet (but where some of them are useful only as the end packet). > + > +Each chunk consists of the following: > + > +S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`) I see you cribbed the number in my original proposal; I got my number from bash's $RANDOM; I don't know if you have any underlying scheme for other magic numbers in use (such as consecutive HEX digits from pi, or something with ASCII meaning, or something in l33t-speak), so I'm fine if anyone has rationale for why a better magic number would be desirable. > +S: 32 bits, flags (including type) > +S: 64 bits, handle > +S: 32 bits, payload length > +S: (*length* bytes of payload data) > + > +The flags have the following meanings: > + > +* bits 0-7: `NBD_CHUNKTYPE`, an 8 bit unsigned integer > +* bits 8-31: reserved (server MUST set these to zero) My v2 separates the flag and type fields. > + > +Possible values of `NBD_CHUNKTYPE` are as follows: > + > +* 0 = `NBD_CHUNKTYPE_END`: the final chunk > +* 1 = `NBD_CHUNKTYPE_DATA`: data that has been read > +* 2 = `NBD_CHUNKTYPE_ZERO`: data that should be considered zero > + > +The format of the payload data for each chunk type is as follows: > + > +##### `NBD_CHUNKTYPE_END` > + > +S: 32 bits, error code or zero for success > +S: 64 bits, offset of error (if any) > + > +##### `NBD_CHUNKTYPE_DATA` > + > +S: 64 bits, offset of data > +S: (*length-8* bytes of data as read) > + > +##### `NBD_CHUNKTYPE_ZERO` > + > +S: 64 bits, offset of data > +S: 32 bits, number of zeroes to which this corresponds Structure wise, our reply types look similar, even though the naming is different. I also had two more types (no data, used for success, and a distinct error without offset type, rather than special-casing -1 as indeterminate offset). > > -Replies need not be sent in the same order as requests (i.e., requests > -may be handled by the server asynchronously). > +The server MUST NOT send chunks that overlap. The server > +MUST NOT send chunks whose data exceeds the length > +of data requested (for this purpose counting the data > +within `NBD_CHUNKTYPE_ZERO` as the number of zero bytes > +specified therein). The server MUST, in the case of a successesful successful > +read send exactly the number of bytes requested (whether > +represented by `NBD_CHUNKTYPE_DATA` or `NBD_CHUNKTYPE_ZERO`). > +The server MUST NOT, in the case of an errored read, send > +more than the number of bytes requested. > + > +In order to avoid the burden of reassembly, the client > +MAY send `NBD_CMD_FLAG_DF`, which instructs the server > +not to fragment the reply. If this flag is set, the server > +MUST send either zero or one data chunks and an `NBD_CHUNKTYPE_END` > +only. Under such circumstances the server MAY error the command > +with `ETOOBIG` if the length read exceeds [65,536 bytes | the `ETOOBIG` is not a standard error; my v2 went with the POSIX EOVERFLOW and defined it to it's Linux value of 75. > + > +If more than one data chunk containing an error has been transmitted > +prior to sending the `NBD_CHUNKTYPE_END`, the server MUST take > +the second option above, to avoid the client assuming that data > +chunks which do not contain the offset marked as errored are > +error-free. My v2 lets the server send multiple error-with-offset packets in the reply chain.
On 30 Mar 2016, at 00:17, Eric Blake <eblake@redhat.com> wrote: >> >> -The server replies with: >> +Replies take one of two forms. They may either be structured replies, > > Hmm, you put your strawman directly in the 'transmission phase' section, > while mine is deferred to the 'Experimental Extensions' section, at > least until we have a reference client and server implementation. Yeah, my wording was straw man but I think it should go into the main body of the work. Obviously in that case it wouldn't be *merged* until we had a working implementation. The SELECT stuff is a bit different as I am not sure it was intended to be standardized short term (i.e. it was a longer term experiment IIRC). I guess Wouter should be the arbiter of whether he'd prefer to merge it only when we have an implementation. My concern is that it may hang around in 'experimental', when it needs properly merging, which will require re-wordsmithing. >> +or unstructured replies. The server MUST NOT send structured replies >> +unless it has negotiated structured replies with the client using >> +`NBD_OPT_STUCTURED_REPLIES` (??). >> + >> +[Option #1: Subject to that, the server may choose whether it sends >> +any given reply to any given command as a structured reply or an >> +unstructured reply.] > > Existing clients are not expecting a structured reply, so at least > SOMETHING must be negotiated before allowing any structured reply. Agree, hence the 'subject to that' and the previous sentence. > But > seems a bit awkward to force a client to recognize two different > possible replies to any given command that does not carry data. A client that supports both types of reply will have routines to process both anyway. It just means that it must have both available for both types of reply. >> + >> +[Option #2: If this option is negotiated, the server MUST send all >> +replies as structured replies. If the option is not negotiated, the >> +server MUST send all replies as unstructured replies.] > > Doable, but slightly more traffic on the wire. In my v2, that would > mean most other commands reply with NBD_REPLY_TYPE_NONE on success, or > NBD_REPLY_TYPE_ERROR on failure. I think this is now my preference. With your NBD_REPLY_TYPE_ERROR thing without the error, we are talking 4 bytes of difference I think. >> + >> +[Option #3: If this option is negotiated, the server MUST send all >> +replies to command that support structured replies as structured >> +replies (currently `NBD_CMD_READ` only), and all other replies as >> +unstructured replies. If the option is not negotiated, the server MUST >> +send all replies as unstructured replies.] > > My v2 went with this approach, for all existing commands, but documented > that future commands (such as the proposed NBD_CMD_GET_LBA_STATUS, or > whatever we name it) may be structured-only. > > I also gave the server latitude to reject clients that negotiate a > future structured-reply command without also negotiating structured > reads, which I guess would help if we want to go with option #2 instead > of #3. > >> +#### Structured replies >> + >> +A structured reply consists of one or more chunks. The server >> +MUST send exactly one end chunk (identified by >> +the chunk type `NBD_CHUNKTYPE_END`), and this MUST be the final >> +chunk within the reply. > > My v2 went with a flag marking the end packet, rather than a specific > type. I actually ended up with 5 defined reply types, where any of them > could be flagged as the end packet (but where some of them are useful > only as the end packet). I think your idea is better here. >> + >> +Each chunk consists of the following: >> + >> +S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`) > > I see you cribbed the number in my original proposal; I got my number > from bash's $RANDOM; I don't know if you have any underlying scheme for > other magic numbers in use (such as consecutive HEX digits from pi, or > something with ASCII meaning, or something in l33t-speak), so I'm fine > if anyone has rationale for why a better magic number would be desirable. Yes I did, and no idea! >> +S: 32 bits, flags (including type) >> +S: 64 bits, handle >> +S: 32 bits, payload length >> +S: (*length* bytes of payload data) >> + >> +The flags have the following meanings: >> + >> +* bits 0-7: `NBD_CHUNKTYPE`, an 8 bit unsigned integer >> +* bits 8-31: reserved (server MUST set these to zero) > > My v2 separates the flag and type fields. Yeah I proposed that later on list. I see you went with 16/16. All fine by me. >> + >> +Possible values of `NBD_CHUNKTYPE` are as follows: >> + >> +* 0 = `NBD_CHUNKTYPE_END`: the final chunk >> +* 1 = `NBD_CHUNKTYPE_DATA`: data that has been read >> +* 2 = `NBD_CHUNKTYPE_ZERO`: data that should be considered zero >> + >> +The format of the payload data for each chunk type is as follows: >> + >> +##### `NBD_CHUNKTYPE_END` >> + >> +S: 32 bits, error code or zero for success >> +S: 64 bits, offset of error (if any) >> + >> +##### `NBD_CHUNKTYPE_DATA` >> + >> +S: 64 bits, offset of data >> +S: (*length-8* bytes of data as read) >> + >> +##### `NBD_CHUNKTYPE_ZERO` >> + >> +S: 64 bits, offset of data >> +S: 32 bits, number of zeroes to which this corresponds > > Structure wise, our reply types look similar, even though the naming is > different. I also had two more types (no data, used for success, and a > distinct error without offset type, rather than special-casing -1 as > indeterminate offset). Yours is better. >> -Replies need not be sent in the same order as requests (i.e., requests >> -may be handled by the server asynchronously). >> +The server MUST NOT send chunks that overlap. The server >> +MUST NOT send chunks whose data exceeds the length >> +of data requested (for this purpose counting the data >> +within `NBD_CHUNKTYPE_ZERO` as the number of zero bytes >> +specified therein). The server MUST, in the case of a successesful > > successful > >> +read send exactly the number of bytes requested (whether >> +represented by `NBD_CHUNKTYPE_DATA` or `NBD_CHUNKTYPE_ZERO`). >> +The server MUST NOT, in the case of an errored read, send >> +more than the number of bytes requested. >> + >> +In order to avoid the burden of reassembly, the client >> +MAY send `NBD_CMD_FLAG_DF`, which instructs the server >> +not to fragment the reply. If this flag is set, the server >> +MUST send either zero or one data chunks and an `NBD_CHUNKTYPE_END` >> +only. Under such circumstances the server MAY error the command >> +with `ETOOBIG` if the length read exceeds [65,536 bytes | the > > `ETOOBIG` is not a standard error; my v2 went with the POSIX EOVERFLOW > and defined it to it's Linux value of 75. Yours is better. >> + >> +If more than one data chunk containing an error has been transmitted >> +prior to sending the `NBD_CHUNKTYPE_END`, the server MUST take >> +the second option above, to avoid the client assuming that data >> +chunks which do not contain the offset marked as errored are >> +error-free. > > My v2 lets the server send multiple error-with-offset packets in the > reply chain. This is also better. -- Alex Bligh
Morning, On Wed, Mar 30, 2016 at 07:59:15AM +0100, Alex Bligh wrote: > On 30 Mar 2016, at 00:17, Eric Blake <eblake@redhat.com> wrote: > >> > >> -The server replies with: > >> +Replies take one of two forms. They may either be structured replies, > > > > Hmm, you put your strawman directly in the 'transmission phase' section, > > while mine is deferred to the 'Experimental Extensions' section, at > > least until we have a reference client and server implementation. > > Yeah, my wording was straw man but I think it should go into the main > body of the work. Obviously in that case it wouldn't be *merged* until > we had a working implementation. > > The SELECT stuff is a bit different as I am not sure it was intended > to be standardized short term (i.e. it was a longer term experiment > IIRC). Actually, I plan to implement that when I get around to doing STARTTLS (which I've started work on, but is far from ready). > I guess Wouter should be the arbiter of whether he'd prefer to merge > it only when we have an implementation. My concern is that it may > hang around in 'experimental', when it needs properly merging, which > will require re-wordsmithing. I'll merge whatever the outcome of this discussion is in the experimental section; that way, it won't get lost. I can reformulate your text to fit there myself if needs be, although obviously having something that doesn't require such work is preferable. However, I'm leaning more towards Eric's proposal at this time, because it feels more mature. [...] > >> +Each chunk consists of the following: > >> + > >> +S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`) > > > > I see you cribbed the number in my original proposal; I got my number > > from bash's $RANDOM; I don't know if you have any underlying scheme for > > other magic numbers in use (such as consecutive HEX digits from pi, or > > something with ASCII meaning, or something in l33t-speak), so I'm fine > > if anyone has rationale for why a better magic number would be desirable. > > Yes I did, and no idea! The oldstyle magic ('cliserv_magic') and the request/reply magic numbers were made by Pavel, who invented NBD. I have no idea how he chose his numbers, either. The newstyle magic ('opts_magic') is ASCII for IHAVEOPT. It is used as the magic number for option haggling, too. There are no further magic numbers in the NBD protocol. [...] > >> + > >> +Possible values of `NBD_CHUNKTYPE` are as follows: > >> + > >> +* 0 = `NBD_CHUNKTYPE_END`: the final chunk > >> +* 1 = `NBD_CHUNKTYPE_DATA`: data that has been read > >> +* 2 = `NBD_CHUNKTYPE_ZERO`: data that should be considered zero > >> + > >> +The format of the payload data for each chunk type is as follows: > >> + > >> +##### `NBD_CHUNKTYPE_END` > >> + > >> +S: 32 bits, error code or zero for success > >> +S: 64 bits, offset of error (if any) > >> + > >> +##### `NBD_CHUNKTYPE_DATA` > >> + > >> +S: 64 bits, offset of data > >> +S: (*length-8* bytes of data as read) > >> + > >> +##### `NBD_CHUNKTYPE_ZERO` > >> + > >> +S: 64 bits, offset of data > >> +S: 32 bits, number of zeroes to which this corresponds > > > > Structure wise, our reply types look similar, even though the naming is > > different. I also had two more types (no data, used for success, and a > > distinct error without offset type, rather than special-casing -1 as > > indeterminate offset). > > Yours is better. I now realize, after having slept over it, that you guys probably meant for an error-with-offset to only mark bytes that were part of *that* particular reply chunk to be marked as faulty, which makes more sense than "the whole request range from that point on", as I was interpreting it... [...]
On 03/30/2016 01:33 AM, Wouter Verhelst wrote: > Morning, > > On Wed, Mar 30, 2016 at 07:59:15AM +0100, Alex Bligh wrote: >> On 30 Mar 2016, at 00:17, Eric Blake <eblake@redhat.com> wrote: >>>> >>>> -The server replies with: >>>> +Replies take one of two forms. They may either be structured replies, >>> >>> Hmm, you put your strawman directly in the 'transmission phase' section, >>> while mine is deferred to the 'Experimental Extensions' section, at >>> least until we have a reference client and server implementation. >> >> Yeah, my wording was straw man but I think it should go into the main >> body of the work. Obviously in that case it wouldn't be *merged* until >> we had a working implementation. >> >> The SELECT stuff is a bit different as I am not sure it was intended >> to be standardized short term (i.e. it was a longer term experiment >> IIRC). > > Actually, I plan to implement that when I get around to doing STARTTLS > (which I've started work on, but is far from ready). On that tangent, I found SELECT slightly ambiguous (particularly since I'm also considering a proposal to modify NBD_REP_SERVER to expose alignment details, so it would have to play nicely with SELECT): Based on normative text alone, we would have: S: 64 bits, 0x3e889045565a9 S: 32 bits, NBD_OPT_SELECT (6) S: 32 bits, NBD_REP_SERVER (2) S: 32 bits, length S: 32 bits, name length S: 'name length' bytes, UTF-8 encoded name (is it NUL-terminated?) S: 'length - name length' bytes, which is UTF-8 encoded message to display to human This implies that 'name length' <= 'length - 4', although we don't actually state that the server MUST NOT send a name length longer than that. It might not hurt to also require that length include space for a NUL terminator (in both the name, and in the optional human-readable information field), but only if that matches existing practice (if it does not, then we should document that the client and server are NOT dealing with NUL-terminated UTF-8 strings, but relying on length instead). The SELECT text then refines this: S: 64 bits, 0x3e889045565a9 S: 32 bits, NBD_OPT_SELECT (6) S: 32 bits, NBD_REP_SERVER (2) S: 32 bits, length S: 32 bits, name length S: 'name length' bytes, UTF-8 encoded name (is it NUL-terminated?) S: 64 bits, size S: 16 bits, export flags but doesn't mention what happens if 'length' > 'name length + 14'. Presumably, if the server wants to include the same UTF-8 human information as before, it would go AFTER the 16 bits for the export field (in other words, the SELECT extension is carving out fields in the MIDDLE of the payload). So, once I know for sure about the handling of NUL bytes, I'll probably try my hand at a patch to clarify the wording in both the normative and in the SELECT extension. >> I guess Wouter should be the arbiter of whether he'd prefer to merge >> it only when we have an implementation. My concern is that it may >> hang around in 'experimental', when it needs properly merging, which >> will require re-wordsmithing. > > I'll merge whatever the outcome of this discussion is in the > experimental section; that way, it won't get lost. I can reformulate > your text to fit there myself if needs be, although obviously having > something that doesn't require such work is preferable. However, I'm > leaning more towards Eric's proposal at this time, because it feels more > mature. Okay, I'll do my best to word things in the experimental section so that we can minimize edits when moving text. Maybe I'll even go so far as to post a 2-patch series; one with the text in the experimental section for committing now; the other for moving the text (but NOT to be committed) to the normative section, for demonstrating the level of effort required. > > I now realize, after having slept over it, that you guys probably meant > for an error-with-offset to only mark bytes that were part of *that* > particular reply chunk to be marked as faulty, which makes more sense > than "the whole request range from that point on", as I was interpreting > it... Indeed, anything I can do to make the wording more clear is welcome. Obviously, a server can mark an entire data chunk invalid by setting the offset to the same value as the offset used in that chunk; the real power is that an offset that is > chunk-offset but < 'chunk-offset + chunk-length' then lets the client know that the first half of chunk was valid, the second half of chunk is bogus, and no other chunk is impacted. And when a lazy server sends error-without-offset, the client is forced to treat ALL chunks as invalid; which is why I state that the server SHOULD NOT mix error-with-offset and error-without-offset in the same reply (SHOULD NOT, but not MUST NOT, because I can't predict every possible error scenario, and there may be some fatal chain of events where the server is forced to mix after all).
On 03/30/2016 12:43 PM, Eric Blake wrote: > On that tangent, I found SELECT slightly ambiguous (particularly since > I'm also considering a proposal to modify NBD_REP_SERVER to expose > alignment details, so it would have to play nicely with SELECT): > > Based on normative text alone, we would have: > > S: 64 bits, 0x3e889045565a9 > S: 32 bits, NBD_OPT_SELECT (6) > S: 32 bits, NBD_REP_SERVER (2) > S: 32 bits, length > S: 32 bits, name length > S: 'name length' bytes, UTF-8 encoded name (is it NUL-terminated?) > S: 'length - name length' bytes, which is UTF-8 encoded message to > display to human > > This implies that 'name length' <= 'length - 4', although we don't > actually state that the server MUST NOT send a name length longer than > that. It might not hurt to also require that length include space for a > NUL terminator (in both the name, and in the optional human-readable > information field), but only if that matches existing practice (if it > does not, then we should document that the client and server are NOT > dealing with NUL-terminated UTF-8 strings, but relying on length instead). > > The SELECT text then refines this: > > S: 64 bits, 0x3e889045565a9 > S: 32 bits, NBD_OPT_SELECT (6) > S: 32 bits, NBD_REP_SERVER (2) > S: 32 bits, length > S: 32 bits, name length > S: 'name length' bytes, UTF-8 encoded name (is it NUL-terminated?) > S: 64 bits, size > S: 16 bits, export flags > > but doesn't mention what happens if 'length' > 'name length + 14'. > Presumably, if the server wants to include the same UTF-8 human > information as before, it would go AFTER the 16 bits for the export > field (in other words, the SELECT extension is carving out fields in the > MIDDLE of the payload). Actually, now that I've thought about it, I wonder if it is better to arrange data so that there is _always_ a UTF-8 human-readable string immediately after the name (with 1-byte content of "" as the minimum), so that clients can just blindly print that string, even if binary data appears later in the structure due to other things (such as NBD_OPT_SELECT) specifying the layout of that binary data. But if we go with that approach, then we should probably document that NBD_REP_SERVER sent in reply to NBD_OPT_SELECT must have length >= 'name-length + 15'. Anywhere that we put binary data where a client used to be expecting human-readable data risks an unprepared client printing garbage. > > So, once I know for sure about the handling of NUL bytes, I'll probably > try my hand at a patch to clarify the wording in both the normative and > in the SELECT extension. At any rate, I hope that I've brought attention to the issue, and that part of moving SELECT out of experimental and into normative will involve documenting decisions you actually make while implementing things.
On Wed, Mar 30, 2016 at 12:43:57PM -0600, Eric Blake wrote: > On 03/30/2016 01:33 AM, Wouter Verhelst wrote: > > Morning, > > > > On Wed, Mar 30, 2016 at 07:59:15AM +0100, Alex Bligh wrote: > >> On 30 Mar 2016, at 00:17, Eric Blake <eblake@redhat.com> wrote: > >>>> > >>>> -The server replies with: > >>>> +Replies take one of two forms. They may either be structured replies, > >>> > >>> Hmm, you put your strawman directly in the 'transmission phase' section, > >>> while mine is deferred to the 'Experimental Extensions' section, at > >>> least until we have a reference client and server implementation. > >> > >> Yeah, my wording was straw man but I think it should go into the main > >> body of the work. Obviously in that case it wouldn't be *merged* until > >> we had a working implementation. > >> > >> The SELECT stuff is a bit different as I am not sure it was intended > >> to be standardized short term (i.e. it was a longer term experiment > >> IIRC). > > > > Actually, I plan to implement that when I get around to doing STARTTLS > > (which I've started work on, but is far from ready). > > On that tangent, I found SELECT slightly ambiguous (particularly since > I'm also considering a proposal to modify NBD_REP_SERVER to expose > alignment details, so it would have to play nicely with SELECT): > > Based on normative text alone, we would have: > > S: 64 bits, 0x3e889045565a9 > S: 32 bits, NBD_OPT_SELECT (6) > S: 32 bits, NBD_REP_SERVER (2) > S: 32 bits, length > S: 32 bits, name length > S: 'name length' bytes, UTF-8 encoded name (is it NUL-terminated?) > S: 'length - name length' bytes, which is UTF-8 encoded message to > display to human Formally, it defines that as "the rest of the data contains some undefined implementation-specific details about the export (...) [i]f the client did not explicitly request otherwise, these details are defined to be UTF-8 encoded data suitable for direct display to a human being." > This implies that 'name length' <= 'length - 4', although we don't > actually state that the server MUST NOT send a name length longer than > that. It might not hurt to also require that length include space for a > NUL terminator (in both the name, and in the optional human-readable > information field), but only if that matches existing practice (if it > does not, then we should document that the client and server are NOT > dealing with NUL-terminated UTF-8 strings, but relying on length instead). > > The SELECT text then refines this: > > S: 64 bits, 0x3e889045565a9 > S: 32 bits, NBD_OPT_SELECT (6) > S: 32 bits, NBD_REP_SERVER (2) > S: 32 bits, length > S: 32 bits, name length > S: 'name length' bytes, UTF-8 encoded name (is it NUL-terminated?) > S: 64 bits, size > S: 16 bits, export flags > > but doesn't mention what happens if 'length' > 'name length + 14'. The idea here was that the client in case of the SELECT extension *does* "explicitly request otherwise", meaning, the details about the export here are no longer undefined or implementation-specific, but instead clearly defined to be the "size" plus the "export flags". > Presumably, if the server wants to include the same UTF-8 human > information as before, it would go AFTER the 16 bits for the export > field (in other words, the SELECT extension is carving out fields in the > MIDDLE of the payload). Sortof. I can see how you come to that conclusion, but it wasn't meant as such. Then again, I'll readily admit that I didn't spend quite as much thinking/discussing about the SELECT extension as compared to the discussion we're having now :-) > So, once I know for sure about the handling of NUL bytes, I'll probably > try my hand at a patch to clarify the wording in both the normative and > in the SELECT extension. nbd-server.c does this: for(i=0; i<servers->len; i++) { SERVER* serve = &(g_array_index(servers, SERVER, i)); len = htonl(strlen(serve->servename)); memcpy(buf, &len, sizeof(len)); strcpy(ptr, serve->servename); send_reply(opt, net, NBD_REP_SERVER, strlen(serve->servename)+sizeof(len), buf); } where ptr = buf + sizeof(len). (on a sidenote, that should really be a strncpy. Let me fix that... there, done) so it doesn't send the NUL byte. The nbd-client side ends up adding a NUL byte after whatever the server sent; I think that makes the most sense, since a client should do so anyway for data it receives from a (potentially evil) peer. [...] > > I now realize, after having slept over it, that you guys probably meant > > for an error-with-offset to only mark bytes that were part of *that* > > particular reply chunk to be marked as faulty, which makes more sense > > than "the whole request range from that point on", as I was interpreting > > it... > > Indeed, anything I can do to make the wording more clear is welcome. You should probably be explicit about which parts of the response are marked as invalid when an error is sent. Also, what happens if a server which checks read()'s return value before forming a header encounters an error? It wouldn't send the broken data to the client (it knows it's broken), it wouldn't send any padding, so it would send an error code to the client about an offset that it will never send? > Obviously, a server can mark an entire data chunk invalid by setting the > offset to the same value as the offset used in that chunk; the real > power is that an offset that is > chunk-offset but < 'chunk-offset + > chunk-length' then lets the client know that the first half of chunk was > valid, the second half of chunk is bogus, and no other chunk is > impacted. Right. That seems like it's close to a clear spec :-) > And when a lazy server sends error-without-offset, the client is > forced to treat ALL chunks as invalid; which is why I state that the > server SHOULD NOT mix error-with-offset and error-without-offset in > the same reply (SHOULD NOT, but not MUST NOT, because I can't predict > every possible error scenario, and there may be some fatal chain of > events where the server is forced to mix after all). "someone just ate my filesystem," for instance :-)
diff --git a/doc/proto.md b/doc/proto.md index aaae0a2..8c89b1e 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -195,15 +195,145 @@ C: 64 bits, offset (unsigned) C: 32 bits, length (unsigned) C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`) -The server replies with: +Replies take one of two forms. They may either be structured replies, +or unstructured replies. The server MUST NOT send structured replies +unless it has negotiated structured replies with the client using +`NBD_OPT_STUCTURED_REPLIES` (??). + +[Option #1: Subject to that, the server may choose whether it sends +any given reply to any given command as a structured reply or an +unstructured reply.] + +[Option #2: If this option is negotiated, the server MUST send all +replies as structured replies. If the option is not negotiated, the +server MUST send all replies as unstructured replies.] + +[Option #3: If this option is negotiated, the server MUST send all +replies to command that support structured replies as structured +replies (currently `NBD_CMD_READ` only), and all other replies as +unstructured replies. If the option is not negotiated, the server MUST +send all replies as unstructured replies.] + +Unstructured replies are problematic for error handling within +`NBD_CMD_READ`, therefore servers SHOULD support structured replies. + +#### Unstructured replies + +In an unstructured reply, the server replies with: S: 32 bits, 0x67446698, magic (`NBD_REPLY_MAGIC`) S: 32 bits, error S: 64 bits, handle -S: (*length* bytes of data if the request is of type `NBD_CMD_READ`) +S: (*length* bytes of data if the request is of type `NBD_CMD_READ`) + +#### Structured replies + +A structured reply consists of one or more chunks. The server +MUST send exactly one end chunk (identified by +the chunk type `NBD_CHUNKTYPE_END`), and this MUST be the final +chunk within the reply. + +Each chunk consists of the following: + +S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`) +S: 32 bits, flags (including type) +S: 64 bits, handle +S: 32 bits, payload length +S: (*length* bytes of payload data) + +The flags have the following meanings: + +* bits 0-7: `NBD_CHUNKTYPE`, an 8 bit unsigned integer +* bits 8-31: reserved (server MUST set these to zero) + +Possible values of `NBD_CHUNKTYPE` are as follows: + +* 0 = `NBD_CHUNKTYPE_END`: the final chunk +* 1 = `NBD_CHUNKTYPE_DATA`: data that has been read +* 2 = `NBD_CHUNKTYPE_ZERO`: data that should be considered zero + +The format of the payload data for each chunk type is as follows: + +##### `NBD_CHUNKTYPE_END` + +S: 32 bits, error code or zero for success +S: 64 bits, offset of error (if any) + +##### `NBD_CHUNKTYPE_DATA` + +S: 64 bits, offset of data +S: (*length-8* bytes of data as read) + +##### `NBD_CHUNKTYPE_ZERO` + +S: 64 bits, offset of data +S: 32 bits, number of zeroes to which this corresponds + + +Commands that return data (currently `NBD_CMD_READ`) therefore MUST +return zero or more chunks each of type `NBD_CHUNKTYPE_DATA` or +`NBD_CHUNKTYPE_ZERO` (collectively 'data chunks') followed +an `NBD_CHUNKTYPE_END`. + +The server MAY split the reply into any non-zero number of data +chunks (provided each consists of at least one byte) and +MAY send the data chunks in any order (though the +`NBD_CHUNKTYPE_END` must be the final chunk). This means the +client is responsible for reassembling the chunks in the correct +order. -Replies need not be sent in the same order as requests (i.e., requests -may be handled by the server asynchronously). +The server MUST NOT send chunks that overlap. The server +MUST NOT send chunks whose data exceeds the length +of data requested (for this purpose counting the data +within `NBD_CHUNKTYPE_ZERO` as the number of zero bytes +specified therein). The server MUST, in the case of a successesful +read send exactly the number of bytes requested (whether +represented by `NBD_CHUNKTYPE_DATA` or `NBD_CHUNKTYPE_ZERO`). +The server MUST NOT, in the case of an errored read, send +more than the number of bytes requested. + +In order to avoid the burden of reassembly, the client +MAY send `NBD_CMD_FLAG_DF`, which instructs the server +not to fragment the reply. If this flag is set, the server +MUST send either zero or one data chunks and an `NBD_CHUNKTYPE_END` +only. Under such circumstances the server MAY error the command +with `ETOOBIG` if the length read exceeds [65,536 bytes | the +negotiated maximum fragment size]. + +If no errors are detected within an operation, the `NBD_CHUNKTYPE_END` +packet MUST contain an error value of zero and an error offset of +zero. + +If the server detects an error during an operation which it +is serving with a structured reply, it MUST complete +the transmission of the errored data chunk(s) if transmission +has started (by padding the chunk(s) concerned with data +which MUST be zero), after which zero or more further +data chunks may be sent, followed by an `NBD_CHUNKTYPE_END` +chunk. The server MUST either: + +* set the offset within `NBD_CHUNKTYPE_END` to the offset of the + error, in which case this MUST be within the length requested; or +* set the offset to 0xffffffff meaning the error occurred at + an unidentified place. + +If more than one data chunk containing an error has been transmitted +prior to sending the `NBD_CHUNKTYPE_END`, the server MUST take +the second option above, to avoid the client assuming that data +chunks which do not contain the offset marked as errored are +error-free. + +#### Ordering of replies + +The server MAY send replies in any order. The order of replies +need not correpsond to the order of requests, i.e., requests +may be handled by the server asynchronously). The server MAY +interleave the chunks relating to a single structured reply +with chunks relating to structured replies relating to +a different handle, or with unstructured replies relating +to a different handle. Note that there is a constraint on +the ordering of chunks within a given structured reply as set out +above; this is a separate issue to the ordering of replies. ## Values
Here's a strawman for the structured reply section. I haven't covered negotation. Also at: https://github.com/abligh/nbd/tree/strawman-structured-reply Markdown at: https://github.com/abligh/nbd/blob/strawman-structured-reply/doc/proto.md Changes since v1: * Fixed some formatting * Inserted 3 options for when structured vs unstructured replies are used * Made use of error location clearer. 0xffffffff indicates 'unknown error position'. * Minor clarifications Signed-off-by: Alex Bligh <alex@alex.org.uk> --- doc/proto.md | 138 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 134 insertions(+), 4 deletions(-)