doc: Propose NBD_FLAG_INIT_ZEROES extension

Message ID	20161205234235.5728-1-eblake@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: Eric Blake <eblake@redhat.com> To: nbd-general@lists.sourceforge.net Date: Mon, 5 Dec 2016 17:42:35 -0600 Message-Id: <20161205234235.5728-1-eblake@redhat.com> Subject: [Qemu-devel] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension Precedence: list Cc: xieyingtai@huawei.com, subo7@huawei.com, qemu-block@nongnu.org, eric.fangyi@huawei.com, stefanha@gmail.com, qemu-devel@nongnu.org, pbonzini@redhat.com Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Eric Blake Dec. 5, 2016, 11:42 p.m. UTC

While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
team discovered that it is useful if a server can advertise
whether an export is in a known-all-zeroes state at the time
the client connects.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 doc/proto.md | 5 +++++
 1 file changed, 5 insertions(+)

This replaces the following qemu patch attempt:
https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00357.html
which tried to add NBD_CMD_HAS_ZERO_INIT with poor semantics. The
semantics in this proposal should be much better.

Patch is to the merge of the master branch and the
extension-write-zeroes branch.  By the way, qemu 2.8 is due
to be released "real soon now", and implements NBD_CMD_WRITE_ZEROES,
so maybe it is time to consider promoting the extension-write-zeroes
branch into master.

Alex Bligh Dec. 6, 2016, 8:46 a.m. UTC | #1

> On 5 Dec 2016, at 23:42, Eric Blake <eblake@redhat.com> wrote:
> 
> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
> team discovered that it is useful if a server can advertise
> whether an export is in a known-all-zeroes state at the time
> the client connects.

I think this is good to go, and ...

> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
> doc/proto.md | 5 +++++
> 1 file changed, 5 insertions(+)
> 
> This replaces the following qemu patch attempt:
> https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00357.html
> which tried to add NBD_CMD_HAS_ZERO_INIT with poor semantics. The
> semantics in this proposal should be much better.
> 
> Patch is to the merge of the master branch and the
> extension-write-zeroes branch.  By the way, qemu 2.8 is due
> to be released "real soon now", and implements NBD_CMD_WRITE_ZEROES,
> so maybe it is time to consider promoting the extension-write-zeroes
> branch into master.

I would support this.

In fact the patch is sufficiently simple I think I'd merge this
into extension-write-zeroes then merge that into master.

Wouter?

Alex

> diff --git a/doc/proto.md b/doc/proto.md
> index afe71fc..7e4ec7f 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -697,6 +697,11 @@ The field has the following format:
>   the export.
> - bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`: defined by the experimental
>   `BLOCK_STATUS` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-blockstatus/doc/proto.md).
> +- bit 10, `NBD_FLAG_INIT_ZEROES`: Indicates that the server guarantees
> +  that at the time transmission phase begins, all offsets within the
> +  export read as zero bytes.  Clients MAY use this information to
> +  avoid writing to sections of the export that should still read as
> +  zero after the client is done writing.
> 
> Clients SHOULD ignore unknown flags.
> 
> -- 
> 2.9.3
> 
> 
> ------------------------------------------------------------------------------
> Developer Access Program for Intel Xeon Phi Processors
> Access to Intel Xeon Phi processor-based developer platforms.
> With one year of Intel Parallel Studio XE.
> Training and support from Colfax.
> Order your platform today.http://sdm.link/xeonphi
> _______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general
>

Kevin Wolf Dec. 6, 2016, 9:25 a.m. UTC | #2

Am 06.12.2016 um 00:42 hat Eric Blake geschrieben:
> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
> team discovered that it is useful if a server can advertise
> whether an export is in a known-all-zeroes state at the time
> the client connects.

Does a server usually have the information to set this flag, other than
querying the block status of all blocks at startup? If so, the client
could just query this by itself.

The patch that was originally sent to qemu-devel just forwarded qemu's
.bdrv_has_zero_init() call to the server. However, what this function
returns is not a known-all-zeroes state on open, but just a
known-all-zeroes state immediately after bdrv_create(), i.e. creating a
new image. Then it becomes information that is easy to get and doesn't
involve querying all blocks (e.g. true for COW image formats, true for
raw on regular files, false for raw on block devices).

This is useful for 'qemu-img convert', which creates an image and then
writes the whole contents, but I'm not sure if this property is
applicable for NBD, which I think doesn't even have a create operation.

Kevin

Stefan Hajnoczi Dec. 6, 2016, 10:18 a.m. UTC | #3

On Mon, Dec 05, 2016 at 05:42:35PM -0600, Eric Blake wrote:
> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
> team discovered that it is useful if a server can advertise
> whether an export is in a known-all-zeroes state at the time
> the client connects.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  doc/proto.md | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> This replaces the following qemu patch attempt:
> https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00357.html
> which tried to add NBD_CMD_HAS_ZERO_INIT with poor semantics. The
> semantics in this proposal should be much better.
> 
> Patch is to the merge of the master branch and the
> extension-write-zeroes branch.  By the way, qemu 2.8 is due
> to be released "real soon now", and implements NBD_CMD_WRITE_ZEROES,
> so maybe it is time to consider promoting the extension-write-zeroes
> branch into master.

Useful, thanks!

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

Alex Bligh Dec. 6, 2016, 10:29 a.m. UTC | #4

> On 6 Dec 2016, at 09:25, Kevin Wolf <kwolf@redhat.com> wrote:
> 
> Am 06.12.2016 um 00:42 hat Eric Blake geschrieben:
>> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
>> team discovered that it is useful if a server can advertise
>> whether an export is in a known-all-zeroes state at the time
>> the client connects.
> 
> Does a server usually have the information to set this flag, other than
> querying the block status of all blocks at startup? 

The server may have other ways of knowing this, for instance
that it has just created the file (*), or that it stat'd the file
before opening it (not unlikely) and noticed it had 0 allocated
size. The latter I suspect would be trivial to implement in nbd-server

(*) = e.g. I had one application where nbd use the export path
to signify it wanted to open a temporary file, the path consisting
of a UUID and an encoded length. If the file was not present already
it created it with ftruncate(). That could trivially have used this.

> If so, the client could just query this by itself.

Well there's no currently mainlined extension to do that, but yes
it could. On the other hand I see no issue passing complete
zero status back to the client if it's so obvious from a stat().

Alex Bligh Dec. 6, 2016, 10:30 a.m. UTC | #5

> On 6 Dec 2016, at 08:46, Alex Bligh <alex@alex.org.uk> wrote:
> 
> I would support this.
> 
> In fact the patch is sufficiently simple I think I'd merge this
> into extension-write-zeroes then merge that into master.

Hence:

Reviewed-By: Alex Bligh <alex@alex.org.uk>

Eric Blake Dec. 6, 2016, 3:21 p.m. UTC | #6

On 12/06/2016 03:25 AM, Kevin Wolf wrote:
> Am 06.12.2016 um 00:42 hat Eric Blake geschrieben:
>> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
>> team discovered that it is useful if a server can advertise
>> whether an export is in a known-all-zeroes state at the time
>> the client connects.
> 
> Does a server usually have the information to set this flag, other than
> querying the block status of all blocks at startup? If so, the client
> could just query this by itself.

Well, only if the client can query information at all (we don't have the
documentation finished for extent queries, let alone a reference
implementation).

> 
> The patch that was originally sent to qemu-devel just forwarded qemu's
> .bdrv_has_zero_init() call to the server. However, what this function
> returns is not a known-all-zeroes state on open, but just a
> known-all-zeroes state immediately after bdrv_create(), i.e. creating a
> new image. Then it becomes information that is easy to get and doesn't
> involve querying all blocks (e.g. true for COW image formats, true for
> raw on regular files, false for raw on block devices).

Just because the NBD spec describes the bit does NOT require that
servers HAVE to set the bit on all images that are all zeroes.  It is
perfectly compliant if the server never advertises the bit.  That said,
I think there are cases where qemu can easily advertise the bit.

I _do_ agree that it is NOT as trivial as the qemu server just
forwarding the value of .bdrv_has_zero_init() - the server HAS to prove
that no data has been written to the image.  But for a qcow2 image just
created with qemu-img, it is a fairly easy proof: If the L1 table has
all-zero entries, then the image has not been written to yet.  Reading
the L1 table for all-zeroes is only a single cluster read, which is MUCH
faster than crawling the entire image for extent status.  And for
regular files, a single lseek(SEEK_DATA) is sufficient to see if the
entire image is currently sparse.

Note that I only proposed the NBD implementation - it still remains to
be coded into the qemu code for the client to make use of the bit
(fairly easy: if the bit is set, the client can make its own
.bdrv_has_zero_init() return true), as well as for the server to set the
bit (harder: the server has to check .bdrv_has_zero_init() of the
wrapped image, but also has to prove the image is still unwritten).
Maybe this means that qemu's block layer wants to add a new
.bdrv_has_been_written() [or whatever name] to better abstract the proof
across drivers.  But those patches would be qemu 2.9 material, and do
not need to further cc the NBD list.

> 
> This is useful for 'qemu-img convert', which creates an image and then
> writes the whole contents, but I'm not sure if this property is
> applicable for NBD, which I think doesn't even have a create operation.

Another option on the NBD server side is to create a server option -
when firing up a server to serve a particular file as an export, the
user can explicitly tell the server to advertise the bit because the
user has side knowledge that the file was just created (and then the
burden of misbehavior is on the user if they mistakenly request the
advertisement when it is not true).

Kevin Wolf Dec. 7, 2016, 10:44 a.m. UTC | #7

Am 06.12.2016 um 16:21 hat Eric Blake geschrieben:
> On 12/06/2016 03:25 AM, Kevin Wolf wrote:
> > Am 06.12.2016 um 00:42 hat Eric Blake geschrieben:
> >> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
> >> team discovered that it is useful if a server can advertise
> >> whether an export is in a known-all-zeroes state at the time
> >> the client connects.
> > 
> > Does a server usually have the information to set this flag, other than
> > querying the block status of all blocks at startup? If so, the client
> > could just query this by itself.
> 
> Well, only if the client can query information at all (we don't have the
> documentation finished for extent queries, let alone a reference
> implementation).

Right, but I think we all agree that this is something that is necessary
and will come sooner or later.

> > The patch that was originally sent to qemu-devel just forwarded qemu's
> > .bdrv_has_zero_init() call to the server. However, what this function
> > returns is not a known-all-zeroes state on open, but just a
> > known-all-zeroes state immediately after bdrv_create(), i.e. creating a
> > new image. Then it becomes information that is easy to get and doesn't
> > involve querying all blocks (e.g. true for COW image formats, true for
> > raw on regular files, false for raw on block devices).
> 
> Just because the NBD spec describes the bit does NOT require that
> servers HAVE to set the bit on all images that are all zeroes.  It is
> perfectly compliant if the server never advertises the bit.

True, but if no server exists that would actually make use of the
feature, it's kind of useless to include it in the spec.

I think we should have concrete use cases in mind when extending the
spec, and explain them in the commit message. Just "maybe this could be
useful for someone sometime" isn't a good enough justification if you
ask me.

> That said, I think there are cases where qemu can easily advertise the
> bit.
> 
> I _do_ agree that it is NOT as trivial as the qemu server just
> forwarding the value of .bdrv_has_zero_init() - the server HAS to prove
> that no data has been written to the image.  But for a qcow2 image just
> created with qemu-img, it is a fairly easy proof: If the L1 table has
> all-zero entries, then the image has not been written to yet.  Reading
> the L1 table for all-zeroes is only a single cluster read, which is MUCH
> faster than crawling the entire image for extent status.  And for
> regular files, a single lseek(SEEK_DATA) is sufficient to see if the
> entire image is currently sparse.
> 
> Note that I only proposed the NBD implementation - it still remains to
> be coded into the qemu code for the client to make use of the bit
> (fairly easy: if the bit is set, the client can make its own
> .bdrv_has_zero_init() return true), as well as for the server to set the
> bit (harder: the server has to check .bdrv_has_zero_init() of the
> wrapped image, but also has to prove the image is still unwritten).
> Maybe this means that qemu's block layer wants to add a new
> .bdrv_has_been_written() [or whatever name] to better abstract the proof
> across drivers.  But those patches would be qemu 2.9 material, and do
> not need to further cc the NBD list.

qemu doesn't really know whether an image has been written to since it
has been created. The interesting case is probably where the image is
created externally with qemu-img before it's exported either with
qemu-nbd or the builtin server, and then we use it as a mirror target.

Even in the rare cases where both image creation and the NBD server are
in the same process, bdrv_create() doesn't work on a BlockDriverState,
but just on a filename. So even then you would have to do hacks like
remembering file names between create and the first open or something
like that.

> > This is useful for 'qemu-img convert', which creates an image and then
> > writes the whole contents, but I'm not sure if this property is
> > applicable for NBD, which I think doesn't even have a create operation.
> 
> Another option on the NBD server side is to create a server option -
> when firing up a server to serve a particular file as an export, the
> user can explicitly tell the server to advertise the bit because the
> user has side knowledge that the file was just created (and then the
> burden of misbehavior is on the user if they mistakenly request the
> advertisement when it is not true).

Maybe that's the only practical approach.

Kevin

Eric Blake Dec. 7, 2016, 3:50 p.m. UTC | #8

On 12/07/2016 04:44 AM, Kevin Wolf wrote:
>> Just because the NBD spec describes the bit does NOT require that
>> servers HAVE to set the bit on all images that are all zeroes.  It is
>> perfectly compliant if the server never advertises the bit.
> 
> True, but if no server exists that would actually make use of the
> feature, it's kind of useless to include it in the spec.

No server, or no client?  I think qemu can be a client fairly easily: if
the server advertises the bit, then the client can set
.bdrv_has_zero_init() and avoid rewriting any sector of a file that is
already known to be zero.

> 
> I think we should have concrete use cases in mind when extending the
> spec, and explain them in the commit message. Just "maybe this could be
> useful for someone sometime" isn't a good enough justification if you
> ask me.

But I think we DO have a concrete case already in mind, both of a client
being able to take advantage of the information if the bit is set, and
of servers that are able to set the bit because they can either
efficiently determine that the file is all-zeroes, or can give the user
a flag to advertise the bit.

> 
> qemu doesn't really know whether an image has been written to since it
> has been created.

It's not so much whether the image has been written to, as much as
easily proving that the image reads as all zeroes.

> The interesting case is probably where the image is
> created externally with qemu-img before it's exported either with
> qemu-nbd or the builtin server, and then we use it as a mirror target.
> 
> Even in the rare cases where both image creation and the NBD server are
> in the same process, bdrv_create() doesn't work on a BlockDriverState,
> but just on a filename. So even then you would have to do hacks like
> remembering file names between create and the first open or something
> like that.

Or add in a driver-specific callback that checks if a file is provably
all-zeroes; for the raw file driver, check if lseek(SEEK_DATA) finds
nothing, for the qcow2 driver, check for no backing files, and no L1
table entries.

>> Another option on the NBD server side is to create a server option -
>> when firing up a server to serve a particular file as an export, the
>> user can explicitly tell the server to advertise the bit because the
>> user has side knowledge that the file was just created (and then the
>> burden of misbehavior is on the user if they mistakenly request the
>> advertisement when it is not true).
> 
> Maybe that's the only practical approach.

But it's still a viable approach, and therefore this bit definition in
the NBD protocol is worthwhile.

Kevin Wolf Dec. 7, 2016, 4:35 p.m. UTC | #9

Am 07.12.2016 um 16:50 hat Eric Blake geschrieben:
> On 12/07/2016 04:44 AM, Kevin Wolf wrote:
> >> Just because the NBD spec describes the bit does NOT require that
> >> servers HAVE to set the bit on all images that are all zeroes.  It is
> >> perfectly compliant if the server never advertises the bit.
> > 
> > True, but if no server exists that would actually make use of the
> > feature, it's kind of useless to include it in the spec.
> 
> No server, or no client?

Well, you need both to make use of the feature.

> I think qemu can be a client fairly easily: if the server advertises
> the bit, then the client can set .bdrv_has_zero_init() and avoid
> rewriting any sector of a file that is already known to be zero.

Yes, the client part makes sense to me and should be easy. Are you going
to write the patches yourself?

> > The interesting case is probably where the image is
> > created externally with qemu-img before it's exported either with
> > qemu-nbd or the builtin server, and then we use it as a mirror target.
> > 
> > Even in the rare cases where both image creation and the NBD server are
> > in the same process, bdrv_create() doesn't work on a BlockDriverState,
> > but just on a filename. So even then you would have to do hacks like
> > remembering file names between create and the first open or something
> > like that.
> 
> Or add in a driver-specific callback that checks if a file is provably
> all-zeroes; for the raw file driver, check if lseek(SEEK_DATA) finds
> nothing, for the qcow2 driver, check for no backing files, and no L1
> table entries.

In theory, there should be a common efficient abstraction for these two:

    bool bs_is_zeroed(BlockDriverState *bs)
    {
        int pum;
        ret = bdrv_get_block_status(bs, 0, bs->total_sectors, *pnum, NULL);
        return (ret == 0 && pnum == bs->total_sectors);
    }

For raw this does the lseek() you mentioned, and for qcow2 it will look
at the L1 table and one L2 table (the first one that exists). This is a
bit more expensive than just looking at the L1 table, but so minimally
that it's far from justifying a new command or flag in a protocol.

Now, in practice, this doesn't work because bdrv_get_block_status() can
return early even if the contiguous area is longer that what it
returns. This is something that we should possibly fix sometime in qemu,
but it's also independent from NBD (we can loop in the NBD server to
give the desired semantics).

So we are already going to export a block status querying interface to
NBD. If we require there that the whole contiguous area of the same
status is described and the server can't just shorten it, then we
get the very same thing without a new flag.

> >> Another option on the NBD server side is to create a server option -
> >> when firing up a server to serve a particular file as an export, the
> >> user can explicitly tell the server to advertise the bit because the
> >> user has side knowledge that the file was just created (and then the
> >> burden of misbehavior is on the user if they mistakenly request the
> >> advertisement when it is not true).
> > 
> > Maybe that's the only practical approach.
> 
> But it's still a viable approach, and therefore this bit definition in
> the NBD protocol is worthwhile.

If it adds something that we can't easily get otherwise, and if someone
volunteers to write the patches, then yes. I'm not completely convinced
yet on that.

Kevin

Wouter Verhelst Dec. 12, 2016, 6:12 p.m. UTC | #10

Hi Eric,

I'm not opposed to this proposal, per se, but there seems to be some
disagreement (by Kevin, for instance) on whether this extension is at
all useful.

If you can clarify that, I'll be happy to merge it.

On Mon, Dec 05, 2016 at 05:42:35PM -0600, Eric Blake wrote:
> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
> team discovered that it is useful if a server can advertise
> whether an export is in a known-all-zeroes state at the time
> the client connects.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  doc/proto.md | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> This replaces the following qemu patch attempt:
> https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00357.html
> which tried to add NBD_CMD_HAS_ZERO_INIT with poor semantics. The
> semantics in this proposal should be much better.
> 
> Patch is to the merge of the master branch and the
> extension-write-zeroes branch.  By the way, qemu 2.8 is due
> to be released "real soon now", and implements NBD_CMD_WRITE_ZEROES,
> so maybe it is time to consider promoting the extension-write-zeroes
> branch into master.
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index afe71fc..7e4ec7f 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -697,6 +697,11 @@ The field has the following format:
>    the export.
>  - bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`: defined by the experimental
>    `BLOCK_STATUS` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-blockstatus/doc/proto.md).
> +- bit 10, `NBD_FLAG_INIT_ZEROES`: Indicates that the server guarantees
> +  that at the time transmission phase begins, all offsets within the
> +  export read as zero bytes.  Clients MAY use this information to
> +  avoid writing to sections of the export that should still read as
> +  zero after the client is done writing.
> 
>  Clients SHOULD ignore unknown flags.
> 
> -- 
> 2.9.3
> 
> 
> ------------------------------------------------------------------------------
> Developer Access Program for Intel Xeon Phi Processors
> Access to Intel Xeon Phi processor-based developer platforms.
> With one year of Intel Parallel Studio XE.
> Training and support from Colfax.
> Order your platform today.http://sdm.link/xeonphi
> _______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general
>

Kevin Wolf Dec. 13, 2016, 7:38 a.m. UTC | #11

Am 12.12.2016 um 19:12 hat Wouter Verhelst geschrieben:
> I'm not opposed to this proposal, per se, but there seems to be some
> disagreement (by Kevin, for instance) on whether this extension is at
> all useful.

FWIW, I'm not opposed to adding the flag. I don't think it can hurt, but
I wanted to ask the question so that we don't end up adding a feature
that noone actually uses. Ultimately it's your call as a maintainer
anyway how conservative you want to be with spec additions.

Kevin

Wouter Verhelst Dec. 13, 2016, 12:18 p.m. UTC | #12

On Tue, Dec 13, 2016 at 08:38:12AM +0100, Kevin Wolf wrote:
> Am 12.12.2016 um 19:12 hat Wouter Verhelst geschrieben:
> > I'm not opposed to this proposal, per se, but there seems to be some
> > disagreement (by Kevin, for instance) on whether this extension is at
> > all useful.
> 
> FWIW, I'm not opposed to adding the flag. I don't think it can hurt, but
> I wanted to ask the question so that we don't end up adding a feature
> that noone actually uses. Ultimately it's your call as a maintainer
> anyway how conservative you want to be with spec additions.

I'm not opposed either, but I do agree with you that we shouldn't add
such a feature if it doesn't end up getting used. Especially so if it
burns a flag in the (16-bit) "transmission flags" field, where space is
at a premium.

Alex Bligh Dec. 13, 2016, 1:19 p.m. UTC | #13

> On 13 Dec 2016, at 12:18, Wouter Verhelst <w@uter.be> wrote:
> 
> I'm not opposed either, but I do agree with you that we shouldn't add
> such a feature if it doesn't end up getting used. Especially so if it
> burns a flag in the (16-bit) "transmission flags" field, where space is
> at a premium.

I did suggest a few non-Qemu uses (see below). I think it might be
an idea if the reference implementation supported it before
merging (which per below should be trivial).

Eric Blake Dec. 13, 2016, 10:36 p.m. UTC | #14

On 12/13/2016 06:18 AM, Wouter Verhelst wrote:
> On Tue, Dec 13, 2016 at 08:38:12AM +0100, Kevin Wolf wrote:
>> Am 12.12.2016 um 19:12 hat Wouter Verhelst geschrieben:
>>> I'm not opposed to this proposal, per se, but there seems to be some
>>> disagreement (by Kevin, for instance) on whether this extension is at
>>> all useful.
>>
>> FWIW, I'm not opposed to adding the flag. I don't think it can hurt, but
>> I wanted to ask the question so that we don't end up adding a feature
>> that noone actually uses. Ultimately it's your call as a maintainer
>> anyway how conservative you want to be with spec additions.
> 
> I'm not opposed either, but I do agree with you that we shouldn't add
> such a feature if it doesn't end up getting used. Especially so if it
> burns a flag in the (16-bit) "transmission flags" field, where space is
> at a premium.

No, it is NOT a "transmission flag", as it is a per-export property
(where we currently have 64 bits).

Wouter Verhelst Dec. 14, 2016, 8:22 a.m. UTC | #15

Hi Eric,

On Tue, Dec 13, 2016 at 04:36:08PM -0600, Eric Blake wrote:
> On 12/13/2016 06:18 AM, Wouter Verhelst wrote:
> > On Tue, Dec 13, 2016 at 08:38:12AM +0100, Kevin Wolf wrote:
> >> Am 12.12.2016 um 19:12 hat Wouter Verhelst geschrieben:
> >>> I'm not opposed to this proposal, per se, but there seems to be some
> >>> disagreement (by Kevin, for instance) on whether this extension is at
> >>> all useful.
> >>
> >> FWIW, I'm not opposed to adding the flag. I don't think it can hurt, but
> >> I wanted to ask the question so that we don't end up adding a feature
> >> that noone actually uses. Ultimately it's your call as a maintainer
> >> anyway how conservative you want to be with spec additions.
> > 
> > I'm not opposed either, but I do agree with you that we shouldn't add
> > such a feature if it doesn't end up getting used. Especially so if it
> > burns a flag in the (16-bit) "transmission flags" field, where space is
> > at a premium.
> 
> No, it is NOT a "transmission flag", as it is a per-export property
> (where we currently have 64 bits).

That may be what you meant, but the patch you sent uses a flag in the
transmission flags field.

If you meant to use something else, you'll have to clarify what your
patch should have been like ;-)

Eric Blake Dec. 14, 2016, 4:37 p.m. UTC | #16

On 12/14/2016 02:22 AM, Wouter Verhelst wrote:
> Hi Eric,
> 
> On Tue, Dec 13, 2016 at 04:36:08PM -0600, Eric Blake wrote:
>> On 12/13/2016 06:18 AM, Wouter Verhelst wrote:
>>> On Tue, Dec 13, 2016 at 08:38:12AM +0100, Kevin Wolf wrote:
>>>> Am 12.12.2016 um 19:12 hat Wouter Verhelst geschrieben:
>>>>> I'm not opposed to this proposal, per se, but there seems to be some
>>>>> disagreement (by Kevin, for instance) on whether this extension is at
>>>>> all useful.
>>>>
>>>> FWIW, I'm not opposed to adding the flag. I don't think it can hurt, but
>>>> I wanted to ask the question so that we don't end up adding a feature
>>>> that noone actually uses. Ultimately it's your call as a maintainer
>>>> anyway how conservative you want to be with spec additions.
>>>
>>> I'm not opposed either, but I do agree with you that we shouldn't add
>>> such a feature if it doesn't end up getting used. Especially so if it
>>> burns a flag in the (16-bit) "transmission flags" field, where space is
>>> at a premium.
>>
>> No, it is NOT a "transmission flag", as it is a per-export property
>> (where we currently have 64 bits).
> 
> That may be what you meant, but the patch you sent uses a flag in the
> transmission flags field.
> 
> If you meant to use something else, you'll have to clarify what your
> patch should have been like ;-)

/me goes back and re-reads spec - I shouldn't reply to mails purely off
of memory...

Okay, I'm crossing terms here.  "Transmission flags" ARE the per-export
flags - and are sent in response to NBD_OPT_EXPORT_NAME or NBD_OPT_INFO
or NBD_OPT_GO.  And you are right that we only have 16 bits in the
current spec, but that they can differ between exports (case in point:
NBD_FLAG_READ_ONLY).  So my proposal of bit 10 as NBD_FLAG_INIT_ZEROES
is potentially in the right place - it is a per-export property (a
server may support multiple named exports for the client to choose from,
of which only a subset are known to be all zero content at the time of
export).  But your argument that 16 bits is at a premium may be worth
considering an alternative representation for the same information.

Possibility 1: when we run out of our current 16 transmission flags,
we'll need a way to extend the protocol to support more than that.  To
do that, we'd probably want a new Handshake flag
NBD_FLAG_LARGE_TRANSMISSION_FLAGS sent by the server, and which must be
answered by the client sending a new Client flag
NBD_FLAG_C_LARGE_TRANSMISSION_FLAGS, so that both sides know that all
subsequent places in the protocol that send transmission flags will now
send 64 bits instead of 16 bits (old-style negotiation cannot benefit
from this, and is permanently stuck at 16 bits, but we discourage
old-style negotiation).  We'd still want to prioritize which bits land
in the first 16 positions, for maximum compatibility with older servers
or clients (where the higher bit positions are used for data that is
more suited for optimizations, rather than controlling correct usage) -
thus NBD_FLAG_INIT_ZEROES would serve as an example to live in the
extended positions.

Possibility 2: Leverage the extension-info branch: Add a new
NBD_INFO_INIT_ZERO information datum.  NBD_BLOCK_INFO already exists to
let server and client exchange as much per-export information as they
want during handshake without burning any new transmission flags, and
with a specification that gracefully ignores unknown requests from the
client and unknown advertisements from the server.  There's the drawback
that the server can't advertise the known-zero-init optimization to
clients that don't use NBD_OPT_GO, but that's not too bad (especially if
qemu is the first client with code to actually make use of the
optimization, since I am already trying to get qemu to be one of the
first adopters of NBD_OPT_GO).

We'll probably have to eventually use something along the lines of
possibility 1 for more transmission flags for some other reason down the
road (since we have lately been adding one flag per new command/command
group as we add extensions), but I agree that it is a bit heavy for now.
 So it sounds like I should rework the proposal along the lines of
possibility 2, which also implies that we should get extension-info
ready for promotion to current.  And that means that my current proposal
to the extension-write-zeroes branch is probably not ideal, but it
reopens the question of whether extension-write-zeroes as it currently
stands is ready for promotion to stable now that qemu 2.8 implements it.

Alex Bligh Dec. 14, 2016, 4:54 p.m. UTC | #17

> ...  But your argument that 16 bits is at a premium may be worth
> considering an alternative representation for the same information.
> 
> Possibility 1: when we run out of our current 16 transmission flags,
> we'll need a way to extend the protocol to support more than that.  To
> do that, we'd probably want a new Handshake flag
> NBD_FLAG_LARGE_TRANSMISSION_FLAGS sent by the server, and which must be
> answered by the client sending a new Client flag
> NBD_FLAG_C_LARGE_TRANSMISSION_FLAGS, so that both sides know that all
> subsequent places in the protocol that send transmission flags will now
> send 64 bits instead of 16 bits (old-style negotiation cannot benefit
> from this, and is permanently stuck at 16 bits, but we discourage
> old-style negotiation).  We'd still want to prioritize which bits land
> in the first 16 positions, for maximum compatibility with older servers
> or clients (where the higher bit positions are used for data that is
> more suited for optimizations, rather than controlling correct usage) -
> thus NBD_FLAG_INIT_ZEROES would serve as an example to live in the
> extended positions.

Well, to me, this just reads like the transmission flags aren't
limited to 16 bits as you've demonstrated a workaround. Any client/server
needing to interpret more bits than the 16 just needs to also
understand extended transmission bit flags. I don't that that's
an unreasonable requirement, so I don't actually buy the prioritisation
argument.

So this seems to me like a good argument we should proceed how
you originally suggested and Wouter should be less worried about
your flag-burning activities :-)

> Possibility 2: Leverage the extension-info branch: Add a new
> NBD_INFO_INIT_ZERO information datum.  NBD_BLOCK_INFO already exists to

I think you mean NBD_INFO_BLOCK_SIZE?

> let server and client exchange as much per-export information as they
> want during handshake without burning any new transmission flags, and
> with a specification that gracefully ignores unknown requests from the
> client and unknown advertisements from the server.  There's the drawback
> that the server can't advertise the known-zero-init optimization to
> clients that don't use NBD_OPT_GO, but that's not too bad (especially if
> qemu is the first client with code to actually make use of the
> optimization, since I am already trying to get qemu to be one of the
> first adopters of NBD_OPT_GO).

I think you are suggesting add NBD_INFO_SOMETHINGELSE which could
be extended transmission flags. Fair enough.

Possibility #3: we modify NBD_INFO_EXPORT (which is still experimental)
*now* to have an extended transmission flags 64 bits. This is where
transmission flags SHOULD go, and just as we'll run out of them
in NBD_OPT_EXPORT_NAME we'll also run out of the same 16 flags in
NBD_OPT_GO. So it seems a good idea to free us from that limitation
now. And all clients (one hopes) will use NBD_OPT_GO eventually.

This has the minor disadvantage of breaking clients that rely on
the current state of the experimental extension, but that's why
it's experimental. Newer clients can talk to older servers if they
are so inclined by checking the length field on the option (which
will extend to 16 from 12).

> We'll probably have to eventually use something along the lines of
> possibility 1 for more transmission flags for some other reason down the
> road (since we have lately been adding one flag per new command/command
> group as we add extensions), but I agree that it is a bit heavy for now.
> So it sounds like I should rework the proposal along the lines of
> possibility 2, which also implies that we should get extension-info
> ready for promotion to current.  And that means that my current proposal
> to the extension-write-zeroes branch is probably not ideal, but it
> reopens the question of whether extension-write-zeroes as it currently
> stands is ready for promotion to stable now that qemu 2.8 implements it.

My view is:
#0 (your original proposal) is actually fine.
#1 seems too heavy
#2 also seems pretty heavyweight - adding a whole new info command for one
   bit
#3 is pretty simple, but carries the disadvantage that you won't be able
   to provide a reference implementation without also putting NBD_OPT_GO
   support into the reference implementation. Oh hang on, perhaps that's
   an advantage :-)

So I'd either go with #0 or #3.

--
Alex Bligh

Wouter Verhelst Dec. 14, 2016, 7:58 p.m. UTC | #18

Hi Eric,

On Wed, Dec 14, 2016 at 10:37:43AM -0600, Eric Blake wrote:
> On 12/14/2016 02:22 AM, Wouter Verhelst wrote:
> > Hi Eric,
> > 
> > On Tue, Dec 13, 2016 at 04:36:08PM -0600, Eric Blake wrote:
> >> On 12/13/2016 06:18 AM, Wouter Verhelst wrote:
> >>> On Tue, Dec 13, 2016 at 08:38:12AM +0100, Kevin Wolf wrote:
> >>>> Am 12.12.2016 um 19:12 hat Wouter Verhelst geschrieben:
> >>>>> I'm not opposed to this proposal, per se, but there seems to be some
> >>>>> disagreement (by Kevin, for instance) on whether this extension is at
> >>>>> all useful.
> >>>>
> >>>> FWIW, I'm not opposed to adding the flag. I don't think it can hurt, but
> >>>> I wanted to ask the question so that we don't end up adding a feature
> >>>> that noone actually uses. Ultimately it's your call as a maintainer
> >>>> anyway how conservative you want to be with spec additions.
> >>>
> >>> I'm not opposed either, but I do agree with you that we shouldn't add
> >>> such a feature if it doesn't end up getting used. Especially so if it
> >>> burns a flag in the (16-bit) "transmission flags" field, where space is
> >>> at a premium.
> >>
> >> No, it is NOT a "transmission flag", as it is a per-export property
> >> (where we currently have 64 bits).
> > 
> > That may be what you meant, but the patch you sent uses a flag in the
> > transmission flags field.
> > 
> > If you meant to use something else, you'll have to clarify what your
> > patch should have been like ;-)
> 
> /me goes back and re-reads spec - I shouldn't reply to mails purely off
> of memory...

Well, I tend to do that too, so no worries :-)

> Okay, I'm crossing terms here.  "Transmission flags" ARE the per-export
> flags - and are sent in response to NBD_OPT_EXPORT_NAME or NBD_OPT_INFO
> or NBD_OPT_GO.  And you are right that we only have 16 bits in the
> current spec, but that they can differ between exports (case in point:
> NBD_FLAG_READ_ONLY).  So my proposal of bit 10 as NBD_FLAG_INIT_ZEROES
> is potentially in the right place - it is a per-export property (a
> server may support multiple named exports for the client to choose from,
> of which only a subset are known to be all zero content at the time of
> export).  But your argument that 16 bits is at a premium may be worth
> considering an alternative representation for the same information.

Right.

The reason we "need" transmission flags (and I grant you that's a bit
weak of a reason) is that they get passed, unmodified, to the kernel in
case of the Linux implementation, by way of ioctl(NBD_SET_FLAGS). The
advantage to that is that we can easily teach the kernel about new
features by simply using transmission flags; if the new feature is a
boolean (i.e., either it can be used or it cannot, but the server won't
care if the client doesn't use it), then using a flag is the easiest way
to accomplish that. If it requires more (e.g., in the case of the block
sizes thing), then a flag doesn't buy us much.

When I wrote up the newstyle negotiation, I split the flags field (which
already existed, and which was already passed into the kernel in that
way) into two. Admittedly that was a mistake, but it is what it is, and
we can't extend the size of the transmission flags field except by
backwards incompatible changes. At any rate though, if the flags ever
get used up, I'm sure we could reserve flag 15 (that is, 1 << 15) as a
marker to say that there's more, and then the kernel could add another
ioctl() to add a second flags field, which the user-space client could
ask from the server by way of another NBD_OPT_* message (and it could
discover that it exists by way of NBD_OPT_INFO).

All that to say that I think a flag is really only useful if it marks
support in the server for some feature which would be backwards
incompatible if the client were to just use it without further
consideration. If that isn't the case, then some option haggling is
probably the better choice.

In case of "this device is initialized with zeroes", I'm frankly not
sure what good that information would do to the kernel; I don't think
it's anything which could allow it to optimize (or not) some things.
Perhaps at the start, yes, but as soon as it's written *something*, it'd
be useless information (unless it were to keep a bitmap of some sort,
but by that time you're optimizing for a corner case, which seems
silly). This, too, is another reason why I don't think your proposal as
stated is very useful.

> Possibility 1: when we run out of our current 16 transmission flags,
> we'll need a way to extend the protocol to support more than that.  To
> do that, we'd probably want a new Handshake flag
> NBD_FLAG_LARGE_TRANSMISSION_FLAGS sent by the server, and which must be
> answered by the client sending a new Client flag
> NBD_FLAG_C_LARGE_TRANSMISSION_FLAGS, so that both sides know that all
> subsequent places in the protocol that send transmission flags will now
> send 64 bits instead of 16 bits (old-style negotiation cannot benefit
> from this, and is permanently stuck at 16 bits, but we discourage
> old-style negotiation).

That could work too, yes.

[...]
> Possibility 2: Leverage the extension-info branch: Add a new
> NBD_INFO_INIT_ZERO information datum.

I think this is preferable, really.

[...]
> We'll probably have to eventually use something along the lines of
> possibility 1 for more transmission flags for some other reason down the
> road (since we have lately been adding one flag per new command/command
> group as we add extensions), but I agree that it is a bit heavy for now.
>  So it sounds like I should rework the proposal along the lines of
> possibility 2, which also implies that we should get extension-info
> ready for promotion to current.

Indeed.

> And that means that my current proposal to the extension-write-zeroes
> branch is probably not ideal, but it reopens the question of whether
> extension-write-zeroes as it currently stands is ready for promotion
> to stable now that qemu 2.8 implements it.

Right.

doc: Propose NBD_FLAG_INIT_ZEROES extension

Commit Message

Comments

Patch