diff mbox

[1/2] NBD proto: add WRITE_ZEROES extension

Message ID 1458742562-30624-2-git-send-email-den@openvz.org (mailing list archive)
State New, archived
Headers show

Commit Message

Denis V. Lunev March 23, 2016, 2:16 p.m. UTC
From: Pavel Borzenkov <pborzenkov@virtuozzo.com>

There exist some cases when a client knows that the data it is going to
write is all zeroes. Such cases include mirroring or backing up a device
implemented by a sparse file.

With current NBD command set, the client has to issue NBD_CMD_WRITE
command with zeroed payload and transfer these zero bytes through the
wire. The server has to write the data onto disk, effectively denying
the sparseness.

To remedy this, the patch adds WRITE_ZEROES extension with one new
NBD_CMD_WRITE_ZEROES command.

Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Wouter Verhelst <w@uter.be>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
---
 doc/proto.md | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

Comments

Eric Blake March 23, 2016, 3:14 p.m. UTC | #1
On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> 
> There exist some cases when a client knows that the data it is going to
> write is all zeroes. Such cases include mirroring or backing up a device
> implemented by a sparse file.
> 
> With current NBD command set, the client has to issue NBD_CMD_WRITE
> command with zeroed payload and transfer these zero bytes through the
> wire. The server has to write the data onto disk, effectively denying
> the sparseness.
> 
> To remedy this, the patch adds WRITE_ZEROES extension with one new
> NBD_CMD_WRITE_ZEROES command.
> 
> Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  doc/proto.md | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 

>  
> +* `NBD_CMD_WRITE_ZEROES` (6)
> +
> +    Defined by the experimental `WRITE_ZEROES` extension; see below.

If this patch goes in to the NBD sources, the extension is not
experimental any more, right?  Shouldn't the patch be written under the
assumption that it will be the final form of the text, even though the
feature is experimental until then? [1]

> +### `WRITE_ZEROES` extension
> +
> +There exist some cases when a client knows that the data it is going to write
> +is all zeroes. Such cases include mirroring or backing up a device implemented
> +by a sparse file. With current NBD command set, the client has to issue
> +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes
> +through the wire. The server has to write the data onto disk, effectively
> +denying the sparseness.

s/denying/losing/ ?

> +
> +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds

s/is envisioned/exists/ - again, under the argument that once it is part
of this document, it is no longer experimental.

/me goes and reads the actual proto.md file, and light bulb turns on...

[1] Oh, you ARE adding this to the "Experimental extensions" section of
the document, so your wording IS correct.  I guess the idea is that we
write up the documentation in the experimental section, tweak qemu to
implement it both as NBD client and as NBD server (since qemu has code
that can serve in both positions), see how well it worked, and THEN do a
followup patch to proto.md to move the text into the non-experimental
section, along with any tweaks learned during the qemu patches.

> +one new command with two command flags.
> +
> +* `NBD_CMD_WRITE_ZEROES` (6)
> +
> +    A write request with no payload. Length and offset define the location
> +    and amount of data to be zeroed.
> +
> +    The server MUST zero out the data on disk, and then send the reply
> +    message. The server MAY send the reply message before the data has
> +    reached permanent storage.
> +
> +    If the `NBD_FLAG_SEND_FUA` flag ("Force Unit Access") was set in the
> +    export flags field, the client MAY set the flag `NBD_CMD_FLAG_FUA` (bit 0)
> +    in the command flags field. If this flag was set, the server MUST NOT send
> +    the reply until it has ensured that the newly-zeroed data has reached
> +    permanent storage.

Do we want to add:

The server SHOULD return EINVAL if the client set NBD_CMD_FLAG_FUA but
the export flags did not include NBD_FLAG_SEND_FUA.

> +
> +    If the flag `NBD_CMD_FLAG_MAY_TRIM` (bit 1) was set by the client in the
> +    command flags field, the server MAY use trimming to zero out the area,
> +    but it MUST ensure that the data reads back as zero.

Bug in the existing spec: The constant NBD_CMD_FLAG_FUA is mentioned,
but never defined with a fixed value.  Your text above defines it to
'bit 0' in the command flags field - is that correct?  If so, should we
add a section to the document that lists the bit values of all supported
command flags?

Meanwhile, your proposed text hardcodes NBD_CMD_FLAG_MAY_TRIM to 'bit
1'; that might also be worth adding into the same new section of the
document documenting all supported command flags.

Do we want to require that the server has negotiated the
NBD_FLAG_SEND_TRIM export flag prior to allowing the
NBD_CMD_FLAG_MAY_TRIM flag in this command?

Possibly-related bug in the existing spec: Should the text of the
NBD_CMD_TRIM (4) command mention the desired server behavior if the
client sends NBD_CMD_TRIM even though NBD_FLAG_SEND_TRIM was not
negotiated in the export flags?  Similarly, should the text of the
NBD_CMD_FLUSH () command mention the desired server behavior if the
client sends NBD_CMD_FLUSH even though NBD_FLAG_SEND_FLUSH was not
negotiated in the export flags?  At least NBD_CMD_FLUSH recommended that
the client must not send the command if the feature was not negotiated.

> +
> +    If an error occurs, the server SHOULD set the appropriate error code
> +    in the error field. The server MAY then close the connection.
> +
> +The server SHOULD return `ENOSPC` if it receives a write zeroes request
> +including one or more sectors beyond the size of the device. It SHOULD
> +return `EPERM` if it receives a write zeroes request on a read-only export.

Should we add a paragraph stating that the client MUST NOT send
NBD_CMD_WRITE_ZEROES if NBD_FLAG_SEND_WRITE_ZEROES was not negotiated in
the export options?  Similarly, should we suggest that the server reply
with EINVAL if it knows about the command, but the client issues the
command in spite of not negotiating it?  Should we enhance the
documentation in the "Error values" heading to mention that EINVAL
should be used in general for any client command not expected by the server?

> +
>  ## About this file
>  
>  This file tries to document the NBD protocol as it is currently
>
Wouter Verhelst March 23, 2016, 5:21 p.m. UTC | #2
Hi,

On Wed, Mar 23, 2016 at 05:16:01PM +0300, Denis V. Lunev wrote:
> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> 
> There exist some cases when a client knows that the data it is going to
> write is all zeroes. Such cases include mirroring or backing up a device
> implemented by a sparse file.
> 
> With current NBD command set, the client has to issue NBD_CMD_WRITE
> command with zeroed payload and transfer these zero bytes through the
> wire. The server has to write the data onto disk, effectively denying
> the sparseness.
> 
> To remedy this, the patch adds WRITE_ZEROES extension with one new
> NBD_CMD_WRITE_ZEROES command.
> 
> Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  doc/proto.md | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index 463ef8a..cda213c 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -241,6 +241,8 @@ immediately after the global flags field in oldstyle negotiation:
>    schedule I/O accesses as for a rotational medium
>  - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
>    `NBD_CMD_TRIM` commands
> +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
> +  supports `NBD_CMD_WRITE_ZEROES` commands
>  
>  ##### Client flags
>  
> @@ -471,6 +473,10 @@ The following request types exist:
>      about the contents of the export affected by this command, until
>      overwriting it again with `NBD_CMD_WRITE`.
>  
> +* `NBD_CMD_WRITE_ZEROES` (6)
> +
> +    Defined by the experimental `WRITE_ZEROES` extension; see below.
> +
>  * Other requests
>  
>      Some third-party implementations may require additional protocol
> @@ -594,6 +600,44 @@ option reply type.
>        message if they do not also send it as a reply to the
>        `NBD_OPT_SELECT` message.
>  
> +### `WRITE_ZEROES` extension
> +
> +There exist some cases when a client knows that the data it is going to write
> +is all zeroes. Such cases include mirroring or backing up a device implemented
> +by a sparse file. With current NBD command set, the client has to issue
> +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes
> +through the wire. The server has to write the data onto disk, effectively
> +denying the sparseness.
> +
> +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds
> +one new command with two command flags.
> +
> +* `NBD_CMD_WRITE_ZEROES` (6)
> +
> +    A write request with no payload. Length and offset define the location
> +    and amount of data to be zeroed.
> +
> +    The server MUST zero out the data on disk, and then send the reply
> +    message. The server MAY send the reply message before the data has
> +    reached permanent storage.
> +
> +    If the `NBD_FLAG_SEND_FUA` flag ("Force Unit Access") was set in the
> +    export flags field, the client MAY set the flag `NBD_CMD_FLAG_FUA` (bit 0)
> +    in the command flags field. If this flag was set, the server MUST NOT send
> +    the reply until it has ensured that the newly-zeroed data has reached
> +    permanent storage.
> +
> +    If the flag `NBD_CMD_FLAG_MAY_TRIM` (bit 1) was set by the client in the
> +    command flags field, the server MAY use trimming to zero out the area,
> +    but it MUST ensure that the data reads back as zero.
> +
> +    If an error occurs, the server SHOULD set the appropriate error code
> +    in the error field. The server MAY then close the connection.
> +
> +The server SHOULD return `ENOSPC` if it receives a write zeroes request
> +including one or more sectors beyond the size of the device. It SHOULD
> +return `EPERM` if it receives a write zeroes request on a read-only export.
> +

So, the semantics of your proposed WRITE_ZEROES are exactly the same as
the WRITE command, except that no payload is sent?

In that case, I think it's slightly more sensible if we don't add a new
command, but instead just have an NBD_CMD_FLAG_ZEROES added to the WRITE
command instead. After all, they're going to be (mostly) the same
anyway.

Did you propose a separate command for a specific reason that I'm
missing (or forgetting), or is that just an oversight?
Wouter Verhelst March 23, 2016, 5:40 p.m. UTC | #3
On Wed, Mar 23, 2016 at 09:14:10AM -0600, Eric Blake wrote:
> On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
[...]
> [1] Oh, you ARE adding this to the "Experimental extensions" section of
> the document, so your wording IS correct.  I guess the idea is that we
> write up the documentation in the experimental section, tweak qemu to
> implement it both as NBD client and as NBD server (since qemu has code
> that can serve in both positions), see how well it worked, and THEN do a
> followup patch to proto.md to move the text into the non-experimental
> section, along with any tweaks learned during the qemu patches.

That's the way I accept specification changes now, yes: define an
experimental spec first, wait for a first implementation (whether that's
qemu or the reference implementation is not as relevant), and only then
move it to the normative part of the spec.

> > +one new command with two command flags.
> > +
> > +* `NBD_CMD_WRITE_ZEROES` (6)
> > +
> > +    A write request with no payload. Length and offset define the location
> > +    and amount of data to be zeroed.
> > +
> > +    The server MUST zero out the data on disk, and then send the reply
> > +    message. The server MAY send the reply message before the data has
> > +    reached permanent storage.
> > +
> > +    If the `NBD_FLAG_SEND_FUA` flag ("Force Unit Access") was set in the
> > +    export flags field, the client MAY set the flag `NBD_CMD_FLAG_FUA` (bit 0)
> > +    in the command flags field. If this flag was set, the server MUST NOT send
> > +    the reply until it has ensured that the newly-zeroed data has reached
> > +    permanent storage.
> 
> Do we want to add:
> 
> The server SHOULD return EINVAL if the client set NBD_CMD_FLAG_FUA but
> the export flags did not include NBD_FLAG_SEND_FUA.

Well, if the export flags didn't include NBD_FLAG_SEND_FUA, that means
the server may not implement that flag, and hence the server may not
know what to do with it.

It's probably a good idea to send a particular error (EINVAL sounds
good, and is what the current reference nbd-server implementation
sends[1]) for commands or flags the server implementation doesn't know
about, but the spec doesn't currently say that explicitly.

[1] actually, the current server sends EINVAL for commands the server
doesn't know about, but silently ignores flags the server doesn't know
about.

> > +    If the flag `NBD_CMD_FLAG_MAY_TRIM` (bit 1) was set by the client in the
> > +    command flags field, the server MAY use trimming to zero out the area,
> > +    but it MUST ensure that the data reads back as zero.
> 
> Bug in the existing spec: The constant NBD_CMD_FLAG_FUA is mentioned,
> but never defined with a fixed value.  Your text above defines it to
> 'bit 0' in the command flags field - is that correct?

(checks code) Yes, it is.

> If so, should we add a section to the document that lists the bit
> values of all supported command flags?

Currently that's only the FUA flag, but yes, I'll do so.

> Meanwhile, your proposed text hardcodes NBD_CMD_FLAG_MAY_TRIM to 'bit
> 1'; that might also be worth adding into the same new section of the
> document documenting all supported command flags.
> 
> Do we want to require that the server has negotiated the
> NBD_FLAG_SEND_TRIM export flag prior to allowing the
> NBD_CMD_FLAG_MAY_TRIM flag in this command?
> 
> Possibly-related bug in the existing spec: Should the text of the
> NBD_CMD_TRIM (4) command mention the desired server behavior if the
> client sends NBD_CMD_TRIM even though NBD_FLAG_SEND_TRIM was not
> negotiated in the export flags?

Not sure that's a good idea. The export flag is there to say "I support
it". If the server doesn't support a feature, the client shouldn't use
that feature. If the client *does* use it, it ends up in undefined
behaviour-land (because the server may be from whatever source).

> Similarly, should the text of the NBD_CMD_FLUSH () command mention the
> desired server behavior if the client sends NBD_CMD_FLUSH even though
> NBD_FLAG_SEND_FLUSH was not negotiated in the export flags?  At least
> NBD_CMD_FLUSH recommended that the client must not send the command if
> the feature was not negotiated.

I suppose it would make sense to make the section on NBD_CMD_TRIM have a
similar notice, yes.

> > +    If an error occurs, the server SHOULD set the appropriate error code
> > +    in the error field. The server MAY then close the connection.
> > +
> > +The server SHOULD return `ENOSPC` if it receives a write zeroes request
> > +including one or more sectors beyond the size of the device. It SHOULD
> > +return `EPERM` if it receives a write zeroes request on a read-only export.
> 
> Should we add a paragraph stating that the client MUST NOT send
> NBD_CMD_WRITE_ZEROES if NBD_FLAG_SEND_WRITE_ZEROES was not negotiated in
> the export options?  Similarly, should we suggest that the server reply
> with EINVAL if it knows about the command, but the client issues the
> command in spite of not negotiating it?  Should we enhance the
> documentation in the "Error values" heading to mention that EINVAL
> should be used in general for any client command not expected by the server?

Yes, probably. As above, the current nbd-server implementation does so,
and I'll make that explicit.
Pavel Borzenkov March 24, 2016, 7:16 a.m. UTC | #4
On Wed, Mar 23, 2016 at 09:14:10AM -0600, Eric Blake wrote:
> On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> > From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > 
> > There exist some cases when a client knows that the data it is going to
> > write is all zeroes. Such cases include mirroring or backing up a device
> > implemented by a sparse file.
> > 
> > With current NBD command set, the client has to issue NBD_CMD_WRITE
> > command with zeroed payload and transfer these zero bytes through the
> > wire. The server has to write the data onto disk, effectively denying
> > the sparseness.
> > 
> > To remedy this, the patch adds WRITE_ZEROES extension with one new
> > NBD_CMD_WRITE_ZEROES command.
> > 
> > Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > CC: Wouter Verhelst <w@uter.be>
> > CC: Paolo Bonzini <pbonzini@redhat.com>
> > CC: Kevin Wolf <kwolf@redhat.com>
> > CC: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  doc/proto.md | 44 ++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 44 insertions(+)
> > 
> 
> >  
> > +* `NBD_CMD_WRITE_ZEROES` (6)
> > +
> > +    Defined by the experimental `WRITE_ZEROES` extension; see below.
> 
> If this patch goes in to the NBD sources, the extension is not
> experimental any more, right?  Shouldn't the patch be written under the
> assumption that it will be the final form of the text, even though the
> feature is experimental until then? [1]
> 
> > +### `WRITE_ZEROES` extension
> > +
> > +There exist some cases when a client knows that the data it is going to write
> > +is all zeroes. Such cases include mirroring or backing up a device implemented
> > +by a sparse file. With current NBD command set, the client has to issue
> > +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes
> > +through the wire. The server has to write the data onto disk, effectively
> > +denying the sparseness.
> 
> s/denying/losing/ ?
> 
> > +
> > +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds
> 
> s/is envisioned/exists/ - again, under the argument that once it is part
> of this document, it is no longer experimental.
> 
> /me goes and reads the actual proto.md file, and light bulb turns on...
> 
> [1] Oh, you ARE adding this to the "Experimental extensions" section of
> the document, so your wording IS correct.  I guess the idea is that we
> write up the documentation in the experimental section, tweak qemu to
> implement it both as NBD client and as NBD server (since qemu has code
> that can serve in both positions), see how well it worked, and THEN do a
> followup patch to proto.md to move the text into the non-experimental
> section, along with any tweaks learned during the qemu patches.

Yes, that is the plan.

> > +one new command with two command flags.
> > +
> > +* `NBD_CMD_WRITE_ZEROES` (6)
> > +
> > +    A write request with no payload. Length and offset define the location
> > +    and amount of data to be zeroed.
> > +
> > +    The server MUST zero out the data on disk, and then send the reply
> > +    message. The server MAY send the reply message before the data has
> > +    reached permanent storage.
> > +
> > +    If the `NBD_FLAG_SEND_FUA` flag ("Force Unit Access") was set in the
> > +    export flags field, the client MAY set the flag `NBD_CMD_FLAG_FUA` (bit 0)
> > +    in the command flags field. If this flag was set, the server MUST NOT send
> > +    the reply until it has ensured that the newly-zeroed data has reached
> > +    permanent storage.
> 
> Do we want to add:
> 
> The server SHOULD return EINVAL if the client set NBD_CMD_FLAG_FUA but
> the export flags did not include NBD_FLAG_SEND_FUA.

Looks like a good idea. Needs to be added here and in NBD_CMD_WRITE
command description.

> 
> > +
> > +    If the flag `NBD_CMD_FLAG_MAY_TRIM` (bit 1) was set by the client in the
> > +    command flags field, the server MAY use trimming to zero out the area,
> > +    but it MUST ensure that the data reads back as zero.
> 
> Bug in the existing spec: The constant NBD_CMD_FLAG_FUA is mentioned,
> but never defined with a fixed value.  Your text above defines it to
> 'bit 0' in the command flags field - is that correct?  If so, should we
> add a section to the document that lists the bit values of all supported
> command flags?

Yes, this is correct (at least this is how it's used in nbd and
QEMU).

My understanding of the NBD specification is that the 'command flags'
field contains per-command flags, thus their values for each command may
coincide. So I think it's better to describe flags supported by each
command in respective command's section, instead of in one global
"Command flags" section.

> 
> Meanwhile, your proposed text hardcodes NBD_CMD_FLAG_MAY_TRIM to 'bit
> 1'; that might also be worth adding into the same new section of the
> document documenting all supported command flags.
> 
> Do we want to require that the server has negotiated the
> NBD_FLAG_SEND_TRIM export flag prior to allowing the
> NBD_CMD_FLAG_MAY_TRIM flag in this command?

I don't think there is a need for such a requirement. Since
NBD_CMD_FLAG_MAY_TRIM flag is merely a suggestion, it can be safely
ignored by the server if it doesn't want to (or can't) use TRIM.

> 
> Possibly-related bug in the existing spec: Should the text of the
> NBD_CMD_TRIM (4) command mention the desired server behavior if the
> client sends NBD_CMD_TRIM even though NBD_FLAG_SEND_TRIM was not
> negotiated in the export flags?  Similarly, should the text of the
> NBD_CMD_FLUSH () command mention the desired server behavior if the
> client sends NBD_CMD_FLUSH even though NBD_FLAG_SEND_FLUSH was not
> negotiated in the export flags?  At least NBD_CMD_FLUSH recommended that
> the client must not send the command if the feature was not negotiated.

Agree, better to mention this.

> 
> > +
> > +    If an error occurs, the server SHOULD set the appropriate error code
> > +    in the error field. The server MAY then close the connection.
> > +
> > +The server SHOULD return `ENOSPC` if it receives a write zeroes request
> > +including one or more sectors beyond the size of the device. It SHOULD
> > +return `EPERM` if it receives a write zeroes request on a read-only export.
> 
> Should we add a paragraph stating that the client MUST NOT send
> NBD_CMD_WRITE_ZEROES if NBD_FLAG_SEND_WRITE_ZEROES was not negotiated in
> the export options?  Similarly, should we suggest that the server reply
> with EINVAL if it knows about the command, but the client issues the
> command in spite of not negotiating it?  Should we enhance the
> documentation in the "Error values" heading to mention that EINVAL
> should be used in general for any client command not expected by the server?

Agree.

So if no one objects, I'll send a patch correcting current spec
ambiguities and than patches with new proposed commands.

> 
> > +
> >  ## About this file
> >  
> >  This file tries to document the NBD protocol as it is currently
> > 
> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
>
Wouter Verhelst March 24, 2016, 7:36 a.m. UTC | #5
On Thu, Mar 24, 2016 at 10:16:19AM +0300, Pavel Borzenkov wrote:
> So if no one objects, I'll send a patch correcting current spec
> ambiguities and than patches with new proposed commands.

Yes please. Thank you.
Pavel Borzenkov March 24, 2016, 7:57 a.m. UTC | #6
On Wed, Mar 23, 2016 at 06:21:16PM +0100, Wouter Verhelst wrote:
> Hi,
> 
> On Wed, Mar 23, 2016 at 05:16:01PM +0300, Denis V. Lunev wrote:
> > From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > 
> > There exist some cases when a client knows that the data it is going to
> > write is all zeroes. Such cases include mirroring or backing up a device
> > implemented by a sparse file.
> > 
> > With current NBD command set, the client has to issue NBD_CMD_WRITE
> > command with zeroed payload and transfer these zero bytes through the
> > wire. The server has to write the data onto disk, effectively denying
> > the sparseness.
> > 
> > To remedy this, the patch adds WRITE_ZEROES extension with one new
> > NBD_CMD_WRITE_ZEROES command.
> > 
> > Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > CC: Wouter Verhelst <w@uter.be>
> > CC: Paolo Bonzini <pbonzini@redhat.com>
> > CC: Kevin Wolf <kwolf@redhat.com>
> > CC: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  doc/proto.md | 44 ++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 44 insertions(+)
> > 
> > diff --git a/doc/proto.md b/doc/proto.md
> > index 463ef8a..cda213c 100644
> > --- a/doc/proto.md
> > +++ b/doc/proto.md
> > @@ -241,6 +241,8 @@ immediately after the global flags field in oldstyle negotiation:
> >    schedule I/O accesses as for a rotational medium
> >  - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
> >    `NBD_CMD_TRIM` commands
> > +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
> > +  supports `NBD_CMD_WRITE_ZEROES` commands
> >  
> >  ##### Client flags
> >  
> > @@ -471,6 +473,10 @@ The following request types exist:
> >      about the contents of the export affected by this command, until
> >      overwriting it again with `NBD_CMD_WRITE`.
> >  
> > +* `NBD_CMD_WRITE_ZEROES` (6)
> > +
> > +    Defined by the experimental `WRITE_ZEROES` extension; see below.
> > +
> >  * Other requests
> >  
> >      Some third-party implementations may require additional protocol
> > @@ -594,6 +600,44 @@ option reply type.
> >        message if they do not also send it as a reply to the
> >        `NBD_OPT_SELECT` message.
> >  
> > +### `WRITE_ZEROES` extension
> > +
> > +There exist some cases when a client knows that the data it is going to write
> > +is all zeroes. Such cases include mirroring or backing up a device implemented
> > +by a sparse file. With current NBD command set, the client has to issue
> > +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes
> > +through the wire. The server has to write the data onto disk, effectively
> > +denying the sparseness.
> > +
> > +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds
> > +one new command with two command flags.
> > +
> > +* `NBD_CMD_WRITE_ZEROES` (6)
> > +
> > +    A write request with no payload. Length and offset define the location
> > +    and amount of data to be zeroed.
> > +
> > +    The server MUST zero out the data on disk, and then send the reply
> > +    message. The server MAY send the reply message before the data has
> > +    reached permanent storage.
> > +
> > +    If the `NBD_FLAG_SEND_FUA` flag ("Force Unit Access") was set in the
> > +    export flags field, the client MAY set the flag `NBD_CMD_FLAG_FUA` (bit 0)
> > +    in the command flags field. If this flag was set, the server MUST NOT send
> > +    the reply until it has ensured that the newly-zeroed data has reached
> > +    permanent storage.
> > +
> > +    If the flag `NBD_CMD_FLAG_MAY_TRIM` (bit 1) was set by the client in the
> > +    command flags field, the server MAY use trimming to zero out the area,
> > +    but it MUST ensure that the data reads back as zero.
> > +
> > +    If an error occurs, the server SHOULD set the appropriate error code
> > +    in the error field. The server MAY then close the connection.
> > +
> > +The server SHOULD return `ENOSPC` if it receives a write zeroes request
> > +including one or more sectors beyond the size of the device. It SHOULD
> > +return `EPERM` if it receives a write zeroes request on a read-only export.
> > +
> 
> So, the semantics of your proposed WRITE_ZEROES are exactly the same as
> the WRITE command, except that no payload is sent?
> 
> In that case, I think it's slightly more sensible if we don't add a new
> command, but instead just have an NBD_CMD_FLAG_ZEROES added to the WRITE
> command instead. After all, they're going to be (mostly) the same
> anyway.
> 
> Did you propose a separate command for a specific reason that I'm
> missing (or forgetting), or is that just an oversight?

No, there is no specific reason. Looks like NBD_CMD_FLAG_ZEROES fits the
spec and implementations nicely. So I'll rewrite the extension and add
the flag instead of the whole command.

> 
> -- 
> < ron> I mean, the main *practical* problem with C++, is there's like a dozen
>        people in the world who think they really understand all of its rules,
>        and pretty much all of them are just lying to themselves too.
>  -- #debian-devel, OFTC, 2016-02-12
> 
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> _______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general
Wouter Verhelst March 24, 2016, 8:26 a.m. UTC | #7
On Thu, Mar 24, 2016 at 10:57:06AM +0300, Pavel Borzenkov wrote:
> On Wed, Mar 23, 2016 at 06:21:16PM +0100, Wouter Verhelst wrote:
> > So, the semantics of your proposed WRITE_ZEROES are exactly the same as
> > the WRITE command, except that no payload is sent?
> > 
> > In that case, I think it's slightly more sensible if we don't add a new
> > command, but instead just have an NBD_CMD_FLAG_ZEROES added to the WRITE
> > command instead. After all, they're going to be (mostly) the same
> > anyway.
> > 
> > Did you propose a separate command for a specific reason that I'm
> > missing (or forgetting), or is that just an oversight?
> 
> No, there is no specific reason. Looks like NBD_CMD_FLAG_ZEROES fits the
> spec and implementations nicely. So I'll rewrite the extension and add
> the flag instead of the whole command.

Actually, having given this some more thought...

There is at least one server-side implementation of nbd (mine) which
silently ignores flags it doesn't know about. This isn't a problem for
non-critical flags, but it could be a problem for a flag like this. Of
course, a client shouldn't send a flag to a server which that server
hasn't heard of, but mistakes do happen.

Do we want to keep that in mind? If so, we might want to keep it as a
separate command after all.

OTOH, it could be said that silently ignoring unknown messages is a bug.
I should probably just fix my implementation instead.
Pavel Borzenkov March 24, 2016, 11:35 a.m. UTC | #8
On Thu, Mar 24, 2016 at 09:26:41AM +0100, Wouter Verhelst wrote:
> On Thu, Mar 24, 2016 at 10:57:06AM +0300, Pavel Borzenkov wrote:
> > On Wed, Mar 23, 2016 at 06:21:16PM +0100, Wouter Verhelst wrote:
> > > So, the semantics of your proposed WRITE_ZEROES are exactly the same as
> > > the WRITE command, except that no payload is sent?
> > > 
> > > In that case, I think it's slightly more sensible if we don't add a new
> > > command, but instead just have an NBD_CMD_FLAG_ZEROES added to the WRITE
> > > command instead. After all, they're going to be (mostly) the same
> > > anyway.
> > > 
> > > Did you propose a separate command for a specific reason that I'm
> > > missing (or forgetting), or is that just an oversight?
> > 
> > No, there is no specific reason. Looks like NBD_CMD_FLAG_ZEROES fits the
> > spec and implementations nicely. So I'll rewrite the extension and add
> > the flag instead of the whole command.
> 
> Actually, having given this some more thought...
> 
> There is at least one server-side implementation of nbd (mine) which
> silently ignores flags it doesn't know about. This isn't a problem for
> non-critical flags, but it could be a problem for a flag like this. Of
> course, a client shouldn't send a flag to a server which that server
> hasn't heard of, but mistakes do happen.
> 
> Do we want to keep that in mind? If so, we might want to keep it as a
> separate command after all.
> 
> OTOH, it could be said that silently ignoring unknown messages is a bug.
> I should probably just fix my implementation instead.

Since we are going to fix such ambiguities in the spec, fixing incorrect
implementations looks like the right thing to do.

But old unfixed versions which ignore unknown flags (or broken
implementation which send ZEROES without negotiating it) will still
exist.

So it looks safer to do it as a separate command. In this case we will
at least receive an error code back.

> 
> -- 
> < ron> I mean, the main *practical* problem with C++, is there's like a dozen
>        people in the world who think they really understand all of its rules,
>        and pretty much all of them are just lying to themselves too.
>  -- #debian-devel, OFTC, 2016-02-12
Paolo Bonzini March 24, 2016, 11:37 a.m. UTC | #9
On 24/03/2016 09:26, Wouter Verhelst wrote:
>> > 
>> > No, there is no specific reason. Looks like NBD_CMD_FLAG_ZEROES fits the
>> > spec and implementations nicely. So I'll rewrite the extension and add
>> > the flag instead of the whole command.
> Actually, having given this some more thought...
> 
> There is at least one server-side implementation of nbd (mine) which
> silently ignores flags it doesn't know about. This isn't a problem for
> non-critical flags, but it could be a problem for a flag like this. Of
> course, a client shouldn't send a flag to a server which that server
> hasn't heard of, but mistakes do happen.
> 
> Do we want to keep that in mind? If so, we might want to keep it as a
> separate command after all.
> 
> OTOH, it could be said that silently ignoring unknown messages is a bug.
> I should probably just fix my implementation instead.

Even if it is a bug, it does suggest that the payload format should not
be changed by flags.  For example ignoring flags is a bug for an NBD
server, but not for a Wireshark protocol dissector.

Paolo
Wouter Verhelst March 24, 2016, 12:31 p.m. UTC | #10
On Thu, Mar 24, 2016 at 12:37:33PM +0100, Paolo Bonzini wrote:
> 
> 
> On 24/03/2016 09:26, Wouter Verhelst wrote:
> >> > 
> >> > No, there is no specific reason. Looks like NBD_CMD_FLAG_ZEROES fits the
> >> > spec and implementations nicely. So I'll rewrite the extension and add
> >> > the flag instead of the whole command.
> > Actually, having given this some more thought...
> > 
> > There is at least one server-side implementation of nbd (mine) which
> > silently ignores flags it doesn't know about. This isn't a problem for
> > non-critical flags, but it could be a problem for a flag like this. Of
> > course, a client shouldn't send a flag to a server which that server
> > hasn't heard of, but mistakes do happen.
> > 
> > Do we want to keep that in mind? If so, we might want to keep it as a
> > separate command after all.
> > 
> > OTOH, it could be said that silently ignoring unknown messages is a bug.
> > I should probably just fix my implementation instead.
> 
> Even if it is a bug, it does suggest that the payload format should not
> be changed by flags.  For example ignoring flags is a bug for an NBD
> server, but not for a Wireshark protocol dissector.

Agreed. Let's make this a different command then, instead.
Eric Blake March 24, 2016, 2:53 p.m. UTC | #11
On 03/24/2016 02:26 AM, Wouter Verhelst wrote:
>> No, there is no specific reason. Looks like NBD_CMD_FLAG_ZEROES fits the
>> spec and implementations nicely. So I'll rewrite the extension and add
>> the flag instead of the whole command.
> 
> Actually, having given this some more thought...
> 
> There is at least one server-side implementation of nbd (mine) which
> silently ignores flags it doesn't know about. This isn't a problem for
> non-critical flags, but it could be a problem for a flag like this. Of
> course, a client shouldn't send a flag to a server which that server
> hasn't heard of, but mistakes do happen.

This is the first time where the presence or absence of a flag would
determine whether the length of the payload should be 0 bytes or len
bytes.  If the server doesn't recognize the flag, then it will try to
read the next length bytes off the wire as the data to write, rather
than as the next command to process.

> 
> Do we want to keep that in mind? If so, we might want to keep it as a
> separate command after all.
> 
> OTOH, it could be said that silently ignoring unknown messages is a bug.
> I should probably just fix my implementation instead.

While I agree that you should fix your implementation, I am strongly in
favor of a new command, so that we can blindly state that the
NBD_CMD_WRITE always sends a payload of length bytes, independent of
flag value.
diff mbox

Patch

diff --git a/doc/proto.md b/doc/proto.md
index 463ef8a..cda213c 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -241,6 +241,8 @@  immediately after the global flags field in oldstyle negotiation:
   schedule I/O accesses as for a rotational medium
 - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
   `NBD_CMD_TRIM` commands
+- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
+  supports `NBD_CMD_WRITE_ZEROES` commands
 
 ##### Client flags
 
@@ -471,6 +473,10 @@  The following request types exist:
     about the contents of the export affected by this command, until
     overwriting it again with `NBD_CMD_WRITE`.
 
+* `NBD_CMD_WRITE_ZEROES` (6)
+
+    Defined by the experimental `WRITE_ZEROES` extension; see below.
+
 * Other requests
 
     Some third-party implementations may require additional protocol
@@ -594,6 +600,44 @@  option reply type.
       message if they do not also send it as a reply to the
       `NBD_OPT_SELECT` message.
 
+### `WRITE_ZEROES` extension
+
+There exist some cases when a client knows that the data it is going to write
+is all zeroes. Such cases include mirroring or backing up a device implemented
+by a sparse file. With current NBD command set, the client has to issue
+`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes
+through the wire. The server has to write the data onto disk, effectively
+denying the sparseness.
+
+To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds
+one new command with two command flags.
+
+* `NBD_CMD_WRITE_ZEROES` (6)
+
+    A write request with no payload. Length and offset define the location
+    and amount of data to be zeroed.
+
+    The server MUST zero out the data on disk, and then send the reply
+    message. The server MAY send the reply message before the data has
+    reached permanent storage.
+
+    If the `NBD_FLAG_SEND_FUA` flag ("Force Unit Access") was set in the
+    export flags field, the client MAY set the flag `NBD_CMD_FLAG_FUA` (bit 0)
+    in the command flags field. If this flag was set, the server MUST NOT send
+    the reply until it has ensured that the newly-zeroed data has reached
+    permanent storage.
+
+    If the flag `NBD_CMD_FLAG_MAY_TRIM` (bit 1) was set by the client in the
+    command flags field, the server MAY use trimming to zero out the area,
+    but it MUST ensure that the data reads back as zero.
+
+    If an error occurs, the server SHOULD set the appropriate error code
+    in the error field. The server MAY then close the connection.
+
+The server SHOULD return `ENOSPC` if it receives a write zeroes request
+including one or more sectors beyond the size of the device. It SHOULD
+return `EPERM` if it receives a write zeroes request on a read-only export.
+
 ## About this file
 
 This file tries to document the NBD protocol as it is currently