Message ID | 20161205234235.5728-1-eblake@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> On 5 Dec 2016, at 23:42, Eric Blake <eblake@redhat.com> wrote: > > While not directly related to NBD_CMD_WRITE_ZEROES, the qemu > team discovered that it is useful if a server can advertise > whether an export is in a known-all-zeroes state at the time > the client connects. I think this is good to go, and ... > Signed-off-by: Eric Blake <eblake@redhat.com> > --- > doc/proto.md | 5 +++++ > 1 file changed, 5 insertions(+) > > This replaces the following qemu patch attempt: > https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00357.html > which tried to add NBD_CMD_HAS_ZERO_INIT with poor semantics. The > semantics in this proposal should be much better. > > Patch is to the merge of the master branch and the > extension-write-zeroes branch. By the way, qemu 2.8 is due > to be released "real soon now", and implements NBD_CMD_WRITE_ZEROES, > so maybe it is time to consider promoting the extension-write-zeroes > branch into master. I would support this. In fact the patch is sufficiently simple I think I'd merge this into extension-write-zeroes then merge that into master. Wouter? Alex > diff --git a/doc/proto.md b/doc/proto.md > index afe71fc..7e4ec7f 100644 > --- a/doc/proto.md > +++ b/doc/proto.md > @@ -697,6 +697,11 @@ The field has the following format: > the export. > - bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`: defined by the experimental > `BLOCK_STATUS` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-blockstatus/doc/proto.md). > +- bit 10, `NBD_FLAG_INIT_ZEROES`: Indicates that the server guarantees > + that at the time transmission phase begins, all offsets within the > + export read as zero bytes. Clients MAY use this information to > + avoid writing to sections of the export that should still read as > + zero after the client is done writing. > > Clients SHOULD ignore unknown flags. > > -- > 2.9.3 > > > ------------------------------------------------------------------------------ > Developer Access Program for Intel Xeon Phi Processors > Access to Intel Xeon Phi processor-based developer platforms. > With one year of Intel Parallel Studio XE. > Training and support from Colfax. > Order your platform today.http://sdm.link/xeonphi > _______________________________________________ > Nbd-general mailing list > Nbd-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nbd-general >
Am 06.12.2016 um 00:42 hat Eric Blake geschrieben: > While not directly related to NBD_CMD_WRITE_ZEROES, the qemu > team discovered that it is useful if a server can advertise > whether an export is in a known-all-zeroes state at the time > the client connects. Does a server usually have the information to set this flag, other than querying the block status of all blocks at startup? If so, the client could just query this by itself. The patch that was originally sent to qemu-devel just forwarded qemu's .bdrv_has_zero_init() call to the server. However, what this function returns is not a known-all-zeroes state on open, but just a known-all-zeroes state immediately after bdrv_create(), i.e. creating a new image. Then it becomes information that is easy to get and doesn't involve querying all blocks (e.g. true for COW image formats, true for raw on regular files, false for raw on block devices). This is useful for 'qemu-img convert', which creates an image and then writes the whole contents, but I'm not sure if this property is applicable for NBD, which I think doesn't even have a create operation. Kevin
On Mon, Dec 05, 2016 at 05:42:35PM -0600, Eric Blake wrote: > While not directly related to NBD_CMD_WRITE_ZEROES, the qemu > team discovered that it is useful if a server can advertise > whether an export is in a known-all-zeroes state at the time > the client connects. > > Signed-off-by: Eric Blake <eblake@redhat.com> > --- > doc/proto.md | 5 +++++ > 1 file changed, 5 insertions(+) > > This replaces the following qemu patch attempt: > https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00357.html > which tried to add NBD_CMD_HAS_ZERO_INIT with poor semantics. The > semantics in this proposal should be much better. > > Patch is to the merge of the master branch and the > extension-write-zeroes branch. By the way, qemu 2.8 is due > to be released "real soon now", and implements NBD_CMD_WRITE_ZEROES, > so maybe it is time to consider promoting the extension-write-zeroes > branch into master. Useful, thanks! Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> On 6 Dec 2016, at 09:25, Kevin Wolf <kwolf@redhat.com> wrote: > > Am 06.12.2016 um 00:42 hat Eric Blake geschrieben: >> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu >> team discovered that it is useful if a server can advertise >> whether an export is in a known-all-zeroes state at the time >> the client connects. > > Does a server usually have the information to set this flag, other than > querying the block status of all blocks at startup? The server may have other ways of knowing this, for instance that it has just created the file (*), or that it stat'd the file before opening it (not unlikely) and noticed it had 0 allocated size. The latter I suspect would be trivial to implement in nbd-server (*) = e.g. I had one application where nbd use the export path to signify it wanted to open a temporary file, the path consisting of a UUID and an encoded length. If the file was not present already it created it with ftruncate(). That could trivially have used this. > If so, the client could just query this by itself. Well there's no currently mainlined extension to do that, but yes it could. On the other hand I see no issue passing complete zero status back to the client if it's so obvious from a stat().
> On 6 Dec 2016, at 08:46, Alex Bligh <alex@alex.org.uk> wrote: > > I would support this. > > In fact the patch is sufficiently simple I think I'd merge this > into extension-write-zeroes then merge that into master. Hence: Reviewed-By: Alex Bligh <alex@alex.org.uk>
On 12/06/2016 03:25 AM, Kevin Wolf wrote: > Am 06.12.2016 um 00:42 hat Eric Blake geschrieben: >> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu >> team discovered that it is useful if a server can advertise >> whether an export is in a known-all-zeroes state at the time >> the client connects. > > Does a server usually have the information to set this flag, other than > querying the block status of all blocks at startup? If so, the client > could just query this by itself. Well, only if the client can query information at all (we don't have the documentation finished for extent queries, let alone a reference implementation). > > The patch that was originally sent to qemu-devel just forwarded qemu's > .bdrv_has_zero_init() call to the server. However, what this function > returns is not a known-all-zeroes state on open, but just a > known-all-zeroes state immediately after bdrv_create(), i.e. creating a > new image. Then it becomes information that is easy to get and doesn't > involve querying all blocks (e.g. true for COW image formats, true for > raw on regular files, false for raw on block devices). Just because the NBD spec describes the bit does NOT require that servers HAVE to set the bit on all images that are all zeroes. It is perfectly compliant if the server never advertises the bit. That said, I think there are cases where qemu can easily advertise the bit. I _do_ agree that it is NOT as trivial as the qemu server just forwarding the value of .bdrv_has_zero_init() - the server HAS to prove that no data has been written to the image. But for a qcow2 image just created with qemu-img, it is a fairly easy proof: If the L1 table has all-zero entries, then the image has not been written to yet. Reading the L1 table for all-zeroes is only a single cluster read, which is MUCH faster than crawling the entire image for extent status. And for regular files, a single lseek(SEEK_DATA) is sufficient to see if the entire image is currently sparse. Note that I only proposed the NBD implementation - it still remains to be coded into the qemu code for the client to make use of the bit (fairly easy: if the bit is set, the client can make its own .bdrv_has_zero_init() return true), as well as for the server to set the bit (harder: the server has to check .bdrv_has_zero_init() of the wrapped image, but also has to prove the image is still unwritten). Maybe this means that qemu's block layer wants to add a new .bdrv_has_been_written() [or whatever name] to better abstract the proof across drivers. But those patches would be qemu 2.9 material, and do not need to further cc the NBD list. > > This is useful for 'qemu-img convert', which creates an image and then > writes the whole contents, but I'm not sure if this property is > applicable for NBD, which I think doesn't even have a create operation. Another option on the NBD server side is to create a server option - when firing up a server to serve a particular file as an export, the user can explicitly tell the server to advertise the bit because the user has side knowledge that the file was just created (and then the burden of misbehavior is on the user if they mistakenly request the advertisement when it is not true).
Am 06.12.2016 um 16:21 hat Eric Blake geschrieben: > On 12/06/2016 03:25 AM, Kevin Wolf wrote: > > Am 06.12.2016 um 00:42 hat Eric Blake geschrieben: > >> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu > >> team discovered that it is useful if a server can advertise > >> whether an export is in a known-all-zeroes state at the time > >> the client connects. > > > > Does a server usually have the information to set this flag, other than > > querying the block status of all blocks at startup? If so, the client > > could just query this by itself. > > Well, only if the client can query information at all (we don't have the > documentation finished for extent queries, let alone a reference > implementation). Right, but I think we all agree that this is something that is necessary and will come sooner or later. > > The patch that was originally sent to qemu-devel just forwarded qemu's > > .bdrv_has_zero_init() call to the server. However, what this function > > returns is not a known-all-zeroes state on open, but just a > > known-all-zeroes state immediately after bdrv_create(), i.e. creating a > > new image. Then it becomes information that is easy to get and doesn't > > involve querying all blocks (e.g. true for COW image formats, true for > > raw on regular files, false for raw on block devices). > > Just because the NBD spec describes the bit does NOT require that > servers HAVE to set the bit on all images that are all zeroes. It is > perfectly compliant if the server never advertises the bit. True, but if no server exists that would actually make use of the feature, it's kind of useless to include it in the spec. I think we should have concrete use cases in mind when extending the spec, and explain them in the commit message. Just "maybe this could be useful for someone sometime" isn't a good enough justification if you ask me. > That said, I think there are cases where qemu can easily advertise the > bit. > > I _do_ agree that it is NOT as trivial as the qemu server just > forwarding the value of .bdrv_has_zero_init() - the server HAS to prove > that no data has been written to the image. But for a qcow2 image just > created with qemu-img, it is a fairly easy proof: If the L1 table has > all-zero entries, then the image has not been written to yet. Reading > the L1 table for all-zeroes is only a single cluster read, which is MUCH > faster than crawling the entire image for extent status. And for > regular files, a single lseek(SEEK_DATA) is sufficient to see if the > entire image is currently sparse. > > Note that I only proposed the NBD implementation - it still remains to > be coded into the qemu code for the client to make use of the bit > (fairly easy: if the bit is set, the client can make its own > .bdrv_has_zero_init() return true), as well as for the server to set the > bit (harder: the server has to check .bdrv_has_zero_init() of the > wrapped image, but also has to prove the image is still unwritten). > Maybe this means that qemu's block layer wants to add a new > .bdrv_has_been_written() [or whatever name] to better abstract the proof > across drivers. But those patches would be qemu 2.9 material, and do > not need to further cc the NBD list. qemu doesn't really know whether an image has been written to since it has been created. The interesting case is probably where the image is created externally with qemu-img before it's exported either with qemu-nbd or the builtin server, and then we use it as a mirror target. Even in the rare cases where both image creation and the NBD server are in the same process, bdrv_create() doesn't work on a BlockDriverState, but just on a filename. So even then you would have to do hacks like remembering file names between create and the first open or something like that. > > This is useful for 'qemu-img convert', which creates an image and then > > writes the whole contents, but I'm not sure if this property is > > applicable for NBD, which I think doesn't even have a create operation. > > Another option on the NBD server side is to create a server option - > when firing up a server to serve a particular file as an export, the > user can explicitly tell the server to advertise the bit because the > user has side knowledge that the file was just created (and then the > burden of misbehavior is on the user if they mistakenly request the > advertisement when it is not true). Maybe that's the only practical approach. Kevin
On 12/07/2016 04:44 AM, Kevin Wolf wrote: >> Just because the NBD spec describes the bit does NOT require that >> servers HAVE to set the bit on all images that are all zeroes. It is >> perfectly compliant if the server never advertises the bit. > > True, but if no server exists that would actually make use of the > feature, it's kind of useless to include it in the spec. No server, or no client? I think qemu can be a client fairly easily: if the server advertises the bit, then the client can set .bdrv_has_zero_init() and avoid rewriting any sector of a file that is already known to be zero. > > I think we should have concrete use cases in mind when extending the > spec, and explain them in the commit message. Just "maybe this could be > useful for someone sometime" isn't a good enough justification if you > ask me. But I think we DO have a concrete case already in mind, both of a client being able to take advantage of the information if the bit is set, and of servers that are able to set the bit because they can either efficiently determine that the file is all-zeroes, or can give the user a flag to advertise the bit. > > qemu doesn't really know whether an image has been written to since it > has been created. It's not so much whether the image has been written to, as much as easily proving that the image reads as all zeroes. > The interesting case is probably where the image is > created externally with qemu-img before it's exported either with > qemu-nbd or the builtin server, and then we use it as a mirror target. > > Even in the rare cases where both image creation and the NBD server are > in the same process, bdrv_create() doesn't work on a BlockDriverState, > but just on a filename. So even then you would have to do hacks like > remembering file names between create and the first open or something > like that. Or add in a driver-specific callback that checks if a file is provably all-zeroes; for the raw file driver, check if lseek(SEEK_DATA) finds nothing, for the qcow2 driver, check for no backing files, and no L1 table entries. >> Another option on the NBD server side is to create a server option - >> when firing up a server to serve a particular file as an export, the >> user can explicitly tell the server to advertise the bit because the >> user has side knowledge that the file was just created (and then the >> burden of misbehavior is on the user if they mistakenly request the >> advertisement when it is not true). > > Maybe that's the only practical approach. But it's still a viable approach, and therefore this bit definition in the NBD protocol is worthwhile.
Am 07.12.2016 um 16:50 hat Eric Blake geschrieben: > On 12/07/2016 04:44 AM, Kevin Wolf wrote: > >> Just because the NBD spec describes the bit does NOT require that > >> servers HAVE to set the bit on all images that are all zeroes. It is > >> perfectly compliant if the server never advertises the bit. > > > > True, but if no server exists that would actually make use of the > > feature, it's kind of useless to include it in the spec. > > No server, or no client? Well, you need both to make use of the feature. > I think qemu can be a client fairly easily: if the server advertises > the bit, then the client can set .bdrv_has_zero_init() and avoid > rewriting any sector of a file that is already known to be zero. Yes, the client part makes sense to me and should be easy. Are you going to write the patches yourself? > > The interesting case is probably where the image is > > created externally with qemu-img before it's exported either with > > qemu-nbd or the builtin server, and then we use it as a mirror target. > > > > Even in the rare cases where both image creation and the NBD server are > > in the same process, bdrv_create() doesn't work on a BlockDriverState, > > but just on a filename. So even then you would have to do hacks like > > remembering file names between create and the first open or something > > like that. > > Or add in a driver-specific callback that checks if a file is provably > all-zeroes; for the raw file driver, check if lseek(SEEK_DATA) finds > nothing, for the qcow2 driver, check for no backing files, and no L1 > table entries. In theory, there should be a common efficient abstraction for these two: bool bs_is_zeroed(BlockDriverState *bs) { int pum; ret = bdrv_get_block_status(bs, 0, bs->total_sectors, *pnum, NULL); return (ret == 0 && pnum == bs->total_sectors); } For raw this does the lseek() you mentioned, and for qcow2 it will look at the L1 table and one L2 table (the first one that exists). This is a bit more expensive than just looking at the L1 table, but so minimally that it's far from justifying a new command or flag in a protocol. Now, in practice, this doesn't work because bdrv_get_block_status() can return early even if the contiguous area is longer that what it returns. This is something that we should possibly fix sometime in qemu, but it's also independent from NBD (we can loop in the NBD server to give the desired semantics). So we are already going to export a block status querying interface to NBD. If we require there that the whole contiguous area of the same status is described and the server can't just shorten it, then we get the very same thing without a new flag. > >> Another option on the NBD server side is to create a server option - > >> when firing up a server to serve a particular file as an export, the > >> user can explicitly tell the server to advertise the bit because the > >> user has side knowledge that the file was just created (and then the > >> burden of misbehavior is on the user if they mistakenly request the > >> advertisement when it is not true). > > > > Maybe that's the only practical approach. > > But it's still a viable approach, and therefore this bit definition in > the NBD protocol is worthwhile. If it adds something that we can't easily get otherwise, and if someone volunteers to write the patches, then yes. I'm not completely convinced yet on that. Kevin
Hi Eric, I'm not opposed to this proposal, per se, but there seems to be some disagreement (by Kevin, for instance) on whether this extension is at all useful. If you can clarify that, I'll be happy to merge it. On Mon, Dec 05, 2016 at 05:42:35PM -0600, Eric Blake wrote: > While not directly related to NBD_CMD_WRITE_ZEROES, the qemu > team discovered that it is useful if a server can advertise > whether an export is in a known-all-zeroes state at the time > the client connects. > > Signed-off-by: Eric Blake <eblake@redhat.com> > --- > doc/proto.md | 5 +++++ > 1 file changed, 5 insertions(+) > > This replaces the following qemu patch attempt: > https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00357.html > which tried to add NBD_CMD_HAS_ZERO_INIT with poor semantics. The > semantics in this proposal should be much better. > > Patch is to the merge of the master branch and the > extension-write-zeroes branch. By the way, qemu 2.8 is due > to be released "real soon now", and implements NBD_CMD_WRITE_ZEROES, > so maybe it is time to consider promoting the extension-write-zeroes > branch into master. > > diff --git a/doc/proto.md b/doc/proto.md > index afe71fc..7e4ec7f 100644 > --- a/doc/proto.md > +++ b/doc/proto.md > @@ -697,6 +697,11 @@ The field has the following format: > the export. > - bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`: defined by the experimental > `BLOCK_STATUS` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-blockstatus/doc/proto.md). > +- bit 10, `NBD_FLAG_INIT_ZEROES`: Indicates that the server guarantees > + that at the time transmission phase begins, all offsets within the > + export read as zero bytes. Clients MAY use this information to > + avoid writing to sections of the export that should still read as > + zero after the client is done writing. > > Clients SHOULD ignore unknown flags. > > -- > 2.9.3 > > > ------------------------------------------------------------------------------ > Developer Access Program for Intel Xeon Phi Processors > Access to Intel Xeon Phi processor-based developer platforms. > With one year of Intel Parallel Studio XE. > Training and support from Colfax. > Order your platform today.http://sdm.link/xeonphi > _______________________________________________ > Nbd-general mailing list > Nbd-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nbd-general >
Am 12.12.2016 um 19:12 hat Wouter Verhelst geschrieben: > I'm not opposed to this proposal, per se, but there seems to be some > disagreement (by Kevin, for instance) on whether this extension is at > all useful. FWIW, I'm not opposed to adding the flag. I don't think it can hurt, but I wanted to ask the question so that we don't end up adding a feature that noone actually uses. Ultimately it's your call as a maintainer anyway how conservative you want to be with spec additions. Kevin
On Tue, Dec 13, 2016 at 08:38:12AM +0100, Kevin Wolf wrote: > Am 12.12.2016 um 19:12 hat Wouter Verhelst geschrieben: > > I'm not opposed to this proposal, per se, but there seems to be some > > disagreement (by Kevin, for instance) on whether this extension is at > > all useful. > > FWIW, I'm not opposed to adding the flag. I don't think it can hurt, but > I wanted to ask the question so that we don't end up adding a feature > that noone actually uses. Ultimately it's your call as a maintainer > anyway how conservative you want to be with spec additions. I'm not opposed either, but I do agree with you that we shouldn't add such a feature if it doesn't end up getting used. Especially so if it burns a flag in the (16-bit) "transmission flags" field, where space is at a premium.
> On 13 Dec 2016, at 12:18, Wouter Verhelst <w@uter.be> wrote: > > I'm not opposed either, but I do agree with you that we shouldn't add > such a feature if it doesn't end up getting used. Especially so if it > burns a flag in the (16-bit) "transmission flags" field, where space is > at a premium. I did suggest a few non-Qemu uses (see below). I think it might be an idea if the reference implementation supported it before merging (which per below should be trivial).
On 12/13/2016 06:18 AM, Wouter Verhelst wrote: > On Tue, Dec 13, 2016 at 08:38:12AM +0100, Kevin Wolf wrote: >> Am 12.12.2016 um 19:12 hat Wouter Verhelst geschrieben: >>> I'm not opposed to this proposal, per se, but there seems to be some >>> disagreement (by Kevin, for instance) on whether this extension is at >>> all useful. >> >> FWIW, I'm not opposed to adding the flag. I don't think it can hurt, but >> I wanted to ask the question so that we don't end up adding a feature >> that noone actually uses. Ultimately it's your call as a maintainer >> anyway how conservative you want to be with spec additions. > > I'm not opposed either, but I do agree with you that we shouldn't add > such a feature if it doesn't end up getting used. Especially so if it > burns a flag in the (16-bit) "transmission flags" field, where space is > at a premium. No, it is NOT a "transmission flag", as it is a per-export property (where we currently have 64 bits).
Hi Eric, On Tue, Dec 13, 2016 at 04:36:08PM -0600, Eric Blake wrote: > On 12/13/2016 06:18 AM, Wouter Verhelst wrote: > > On Tue, Dec 13, 2016 at 08:38:12AM +0100, Kevin Wolf wrote: > >> Am 12.12.2016 um 19:12 hat Wouter Verhelst geschrieben: > >>> I'm not opposed to this proposal, per se, but there seems to be some > >>> disagreement (by Kevin, for instance) on whether this extension is at > >>> all useful. > >> > >> FWIW, I'm not opposed to adding the flag. I don't think it can hurt, but > >> I wanted to ask the question so that we don't end up adding a feature > >> that noone actually uses. Ultimately it's your call as a maintainer > >> anyway how conservative you want to be with spec additions. > > > > I'm not opposed either, but I do agree with you that we shouldn't add > > such a feature if it doesn't end up getting used. Especially so if it > > burns a flag in the (16-bit) "transmission flags" field, where space is > > at a premium. > > No, it is NOT a "transmission flag", as it is a per-export property > (where we currently have 64 bits). That may be what you meant, but the patch you sent uses a flag in the transmission flags field. If you meant to use something else, you'll have to clarify what your patch should have been like ;-)
On 12/14/2016 02:22 AM, Wouter Verhelst wrote: > Hi Eric, > > On Tue, Dec 13, 2016 at 04:36:08PM -0600, Eric Blake wrote: >> On 12/13/2016 06:18 AM, Wouter Verhelst wrote: >>> On Tue, Dec 13, 2016 at 08:38:12AM +0100, Kevin Wolf wrote: >>>> Am 12.12.2016 um 19:12 hat Wouter Verhelst geschrieben: >>>>> I'm not opposed to this proposal, per se, but there seems to be some >>>>> disagreement (by Kevin, for instance) on whether this extension is at >>>>> all useful. >>>> >>>> FWIW, I'm not opposed to adding the flag. I don't think it can hurt, but >>>> I wanted to ask the question so that we don't end up adding a feature >>>> that noone actually uses. Ultimately it's your call as a maintainer >>>> anyway how conservative you want to be with spec additions. >>> >>> I'm not opposed either, but I do agree with you that we shouldn't add >>> such a feature if it doesn't end up getting used. Especially so if it >>> burns a flag in the (16-bit) "transmission flags" field, where space is >>> at a premium. >> >> No, it is NOT a "transmission flag", as it is a per-export property >> (where we currently have 64 bits). > > That may be what you meant, but the patch you sent uses a flag in the > transmission flags field. > > If you meant to use something else, you'll have to clarify what your > patch should have been like ;-) /me goes back and re-reads spec - I shouldn't reply to mails purely off of memory... Okay, I'm crossing terms here. "Transmission flags" ARE the per-export flags - and are sent in response to NBD_OPT_EXPORT_NAME or NBD_OPT_INFO or NBD_OPT_GO. And you are right that we only have 16 bits in the current spec, but that they can differ between exports (case in point: NBD_FLAG_READ_ONLY). So my proposal of bit 10 as NBD_FLAG_INIT_ZEROES is potentially in the right place - it is a per-export property (a server may support multiple named exports for the client to choose from, of which only a subset are known to be all zero content at the time of export). But your argument that 16 bits is at a premium may be worth considering an alternative representation for the same information. Possibility 1: when we run out of our current 16 transmission flags, we'll need a way to extend the protocol to support more than that. To do that, we'd probably want a new Handshake flag NBD_FLAG_LARGE_TRANSMISSION_FLAGS sent by the server, and which must be answered by the client sending a new Client flag NBD_FLAG_C_LARGE_TRANSMISSION_FLAGS, so that both sides know that all subsequent places in the protocol that send transmission flags will now send 64 bits instead of 16 bits (old-style negotiation cannot benefit from this, and is permanently stuck at 16 bits, but we discourage old-style negotiation). We'd still want to prioritize which bits land in the first 16 positions, for maximum compatibility with older servers or clients (where the higher bit positions are used for data that is more suited for optimizations, rather than controlling correct usage) - thus NBD_FLAG_INIT_ZEROES would serve as an example to live in the extended positions. Possibility 2: Leverage the extension-info branch: Add a new NBD_INFO_INIT_ZERO information datum. NBD_BLOCK_INFO already exists to let server and client exchange as much per-export information as they want during handshake without burning any new transmission flags, and with a specification that gracefully ignores unknown requests from the client and unknown advertisements from the server. There's the drawback that the server can't advertise the known-zero-init optimization to clients that don't use NBD_OPT_GO, but that's not too bad (especially if qemu is the first client with code to actually make use of the optimization, since I am already trying to get qemu to be one of the first adopters of NBD_OPT_GO). We'll probably have to eventually use something along the lines of possibility 1 for more transmission flags for some other reason down the road (since we have lately been adding one flag per new command/command group as we add extensions), but I agree that it is a bit heavy for now. So it sounds like I should rework the proposal along the lines of possibility 2, which also implies that we should get extension-info ready for promotion to current. And that means that my current proposal to the extension-write-zeroes branch is probably not ideal, but it reopens the question of whether extension-write-zeroes as it currently stands is ready for promotion to stable now that qemu 2.8 implements it.
> ... But your argument that 16 bits is at a premium may be worth > considering an alternative representation for the same information. > > Possibility 1: when we run out of our current 16 transmission flags, > we'll need a way to extend the protocol to support more than that. To > do that, we'd probably want a new Handshake flag > NBD_FLAG_LARGE_TRANSMISSION_FLAGS sent by the server, and which must be > answered by the client sending a new Client flag > NBD_FLAG_C_LARGE_TRANSMISSION_FLAGS, so that both sides know that all > subsequent places in the protocol that send transmission flags will now > send 64 bits instead of 16 bits (old-style negotiation cannot benefit > from this, and is permanently stuck at 16 bits, but we discourage > old-style negotiation). We'd still want to prioritize which bits land > in the first 16 positions, for maximum compatibility with older servers > or clients (where the higher bit positions are used for data that is > more suited for optimizations, rather than controlling correct usage) - > thus NBD_FLAG_INIT_ZEROES would serve as an example to live in the > extended positions. Well, to me, this just reads like the transmission flags aren't limited to 16 bits as you've demonstrated a workaround. Any client/server needing to interpret more bits than the 16 just needs to also understand extended transmission bit flags. I don't that that's an unreasonable requirement, so I don't actually buy the prioritisation argument. So this seems to me like a good argument we should proceed how you originally suggested and Wouter should be less worried about your flag-burning activities :-) > Possibility 2: Leverage the extension-info branch: Add a new > NBD_INFO_INIT_ZERO information datum. NBD_BLOCK_INFO already exists to I think you mean NBD_INFO_BLOCK_SIZE? > let server and client exchange as much per-export information as they > want during handshake without burning any new transmission flags, and > with a specification that gracefully ignores unknown requests from the > client and unknown advertisements from the server. There's the drawback > that the server can't advertise the known-zero-init optimization to > clients that don't use NBD_OPT_GO, but that's not too bad (especially if > qemu is the first client with code to actually make use of the > optimization, since I am already trying to get qemu to be one of the > first adopters of NBD_OPT_GO). I think you are suggesting add NBD_INFO_SOMETHINGELSE which could be extended transmission flags. Fair enough. Possibility #3: we modify NBD_INFO_EXPORT (which is still experimental) *now* to have an extended transmission flags 64 bits. This is where transmission flags SHOULD go, and just as we'll run out of them in NBD_OPT_EXPORT_NAME we'll also run out of the same 16 flags in NBD_OPT_GO. So it seems a good idea to free us from that limitation now. And all clients (one hopes) will use NBD_OPT_GO eventually. This has the minor disadvantage of breaking clients that rely on the current state of the experimental extension, but that's why it's experimental. Newer clients can talk to older servers if they are so inclined by checking the length field on the option (which will extend to 16 from 12). > We'll probably have to eventually use something along the lines of > possibility 1 for more transmission flags for some other reason down the > road (since we have lately been adding one flag per new command/command > group as we add extensions), but I agree that it is a bit heavy for now. > So it sounds like I should rework the proposal along the lines of > possibility 2, which also implies that we should get extension-info > ready for promotion to current. And that means that my current proposal > to the extension-write-zeroes branch is probably not ideal, but it > reopens the question of whether extension-write-zeroes as it currently > stands is ready for promotion to stable now that qemu 2.8 implements it. My view is: #0 (your original proposal) is actually fine. #1 seems too heavy #2 also seems pretty heavyweight - adding a whole new info command for one bit #3 is pretty simple, but carries the disadvantage that you won't be able to provide a reference implementation without also putting NBD_OPT_GO support into the reference implementation. Oh hang on, perhaps that's an advantage :-) So I'd either go with #0 or #3. -- Alex Bligh
Hi Eric, On Wed, Dec 14, 2016 at 10:37:43AM -0600, Eric Blake wrote: > On 12/14/2016 02:22 AM, Wouter Verhelst wrote: > > Hi Eric, > > > > On Tue, Dec 13, 2016 at 04:36:08PM -0600, Eric Blake wrote: > >> On 12/13/2016 06:18 AM, Wouter Verhelst wrote: > >>> On Tue, Dec 13, 2016 at 08:38:12AM +0100, Kevin Wolf wrote: > >>>> Am 12.12.2016 um 19:12 hat Wouter Verhelst geschrieben: > >>>>> I'm not opposed to this proposal, per se, but there seems to be some > >>>>> disagreement (by Kevin, for instance) on whether this extension is at > >>>>> all useful. > >>>> > >>>> FWIW, I'm not opposed to adding the flag. I don't think it can hurt, but > >>>> I wanted to ask the question so that we don't end up adding a feature > >>>> that noone actually uses. Ultimately it's your call as a maintainer > >>>> anyway how conservative you want to be with spec additions. > >>> > >>> I'm not opposed either, but I do agree with you that we shouldn't add > >>> such a feature if it doesn't end up getting used. Especially so if it > >>> burns a flag in the (16-bit) "transmission flags" field, where space is > >>> at a premium. > >> > >> No, it is NOT a "transmission flag", as it is a per-export property > >> (where we currently have 64 bits). > > > > That may be what you meant, but the patch you sent uses a flag in the > > transmission flags field. > > > > If you meant to use something else, you'll have to clarify what your > > patch should have been like ;-) > > /me goes back and re-reads spec - I shouldn't reply to mails purely off > of memory... Well, I tend to do that too, so no worries :-) > Okay, I'm crossing terms here. "Transmission flags" ARE the per-export > flags - and are sent in response to NBD_OPT_EXPORT_NAME or NBD_OPT_INFO > or NBD_OPT_GO. And you are right that we only have 16 bits in the > current spec, but that they can differ between exports (case in point: > NBD_FLAG_READ_ONLY). So my proposal of bit 10 as NBD_FLAG_INIT_ZEROES > is potentially in the right place - it is a per-export property (a > server may support multiple named exports for the client to choose from, > of which only a subset are known to be all zero content at the time of > export). But your argument that 16 bits is at a premium may be worth > considering an alternative representation for the same information. Right. The reason we "need" transmission flags (and I grant you that's a bit weak of a reason) is that they get passed, unmodified, to the kernel in case of the Linux implementation, by way of ioctl(NBD_SET_FLAGS). The advantage to that is that we can easily teach the kernel about new features by simply using transmission flags; if the new feature is a boolean (i.e., either it can be used or it cannot, but the server won't care if the client doesn't use it), then using a flag is the easiest way to accomplish that. If it requires more (e.g., in the case of the block sizes thing), then a flag doesn't buy us much. When I wrote up the newstyle negotiation, I split the flags field (which already existed, and which was already passed into the kernel in that way) into two. Admittedly that was a mistake, but it is what it is, and we can't extend the size of the transmission flags field except by backwards incompatible changes. At any rate though, if the flags ever get used up, I'm sure we could reserve flag 15 (that is, 1 << 15) as a marker to say that there's more, and then the kernel could add another ioctl() to add a second flags field, which the user-space client could ask from the server by way of another NBD_OPT_* message (and it could discover that it exists by way of NBD_OPT_INFO). All that to say that I think a flag is really only useful if it marks support in the server for some feature which would be backwards incompatible if the client were to just use it without further consideration. If that isn't the case, then some option haggling is probably the better choice. In case of "this device is initialized with zeroes", I'm frankly not sure what good that information would do to the kernel; I don't think it's anything which could allow it to optimize (or not) some things. Perhaps at the start, yes, but as soon as it's written *something*, it'd be useless information (unless it were to keep a bitmap of some sort, but by that time you're optimizing for a corner case, which seems silly). This, too, is another reason why I don't think your proposal as stated is very useful. > Possibility 1: when we run out of our current 16 transmission flags, > we'll need a way to extend the protocol to support more than that. To > do that, we'd probably want a new Handshake flag > NBD_FLAG_LARGE_TRANSMISSION_FLAGS sent by the server, and which must be > answered by the client sending a new Client flag > NBD_FLAG_C_LARGE_TRANSMISSION_FLAGS, so that both sides know that all > subsequent places in the protocol that send transmission flags will now > send 64 bits instead of 16 bits (old-style negotiation cannot benefit > from this, and is permanently stuck at 16 bits, but we discourage > old-style negotiation). That could work too, yes. [...] > Possibility 2: Leverage the extension-info branch: Add a new > NBD_INFO_INIT_ZERO information datum. I think this is preferable, really. [...] > We'll probably have to eventually use something along the lines of > possibility 1 for more transmission flags for some other reason down the > road (since we have lately been adding one flag per new command/command > group as we add extensions), but I agree that it is a bit heavy for now. > So it sounds like I should rework the proposal along the lines of > possibility 2, which also implies that we should get extension-info > ready for promotion to current. Indeed. > And that means that my current proposal to the extension-write-zeroes > branch is probably not ideal, but it reopens the question of whether > extension-write-zeroes as it currently stands is ready for promotion > to stable now that qemu 2.8 implements it. Right.
diff --git a/doc/proto.md b/doc/proto.md index afe71fc..7e4ec7f 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -697,6 +697,11 @@ The field has the following format: the export. - bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`: defined by the experimental `BLOCK_STATUS` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-blockstatus/doc/proto.md). +- bit 10, `NBD_FLAG_INIT_ZEROES`: Indicates that the server guarantees + that at the time transmission phase begins, all offsets within the + export read as zero bytes. Clients MAY use this information to + avoid writing to sections of the export that should still read as + zero after the client is done writing. Clients SHOULD ignore unknown flags.
While not directly related to NBD_CMD_WRITE_ZEROES, the qemu team discovered that it is useful if a server can advertise whether an export is in a known-all-zeroes state at the time the client connects. Signed-off-by: Eric Blake <eblake@redhat.com> --- doc/proto.md | 5 +++++ 1 file changed, 5 insertions(+) This replaces the following qemu patch attempt: https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00357.html which tried to add NBD_CMD_HAS_ZERO_INIT with poor semantics. The semantics in this proposal should be much better. Patch is to the merge of the master branch and the extension-write-zeroes branch. By the way, qemu 2.8 is due to be released "real soon now", and implements NBD_CMD_WRITE_ZEROES, so maybe it is time to consider promoting the extension-write-zeroes branch into master.