Message ID | 1455732653-3106-1-git-send-email-den@openvz.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 02/17/2016 11:10 AM, Denis V. Lunev wrote: > This patch proposes a new command to reduce the amount of data passed > through the wire when it is known that the data is all zeroes. This > functionality is generally useful for mirroring or backup operations. > > Currently available NBD_CMD_TRIM command can not be used as the > specification explicitely says that "a client MUST NOT make any s/explicitely/explicitly/ > assumptions about the contents of the export affected by this > [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`" > > Particular use case could be the following: > > QEMU project uses own implementation of NBD server to transfer data > in between different instances of QEMU. Typically we tranfer VM virtual s/tranfer/transfer/ > disks over this channel. VM virtual disks are sparse and thus the > efficiency of backup and mirroring operations could be improved a lot. > > Signed-off-by: Denis V. Lunev <den@openvz.org> > --- > doc/proto.md | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/doc/proto.md b/doc/proto.md > index 43065b7..c94751a 100644 > --- a/doc/proto.md > +++ b/doc/proto.md > @@ -241,6 +241,8 @@ immediately after the global flags field in oldstyle negotiation: > schedule I/O accesses as for a rotational medium > - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports > `NBD_CMD_TRIM` commands > +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server > + supports `NBD_CMD_WRITE_ZEROES` commands > > ##### Client flags > > @@ -446,6 +448,11 @@ The following request types exist: > about the contents of the export affected by this command, until > overwriting it again with `NBD_CMD_WRITE`. > > +* `NBD_CMD_WRITE_ZEROES` (6) > + > + A request to write zeroes. The command is functional equivalent of > + the NBD_WRITE_COMMAND but without payload sent through the channel. This lets us push holes during writes. Do we have the converse operation, that is, an easy way to query if a block of data will read as all zeroes, and therefore the client can bypass reading that portion of the disk (in other words, an equivalent to lseek(SEEK_HOLE/SEEK_DATA))?
On 02/17/2016 11:58 PM, Eric Blake wrote: > On 02/17/2016 11:10 AM, Denis V. Lunev wrote: >> This patch proposes a new command to reduce the amount of data passed >> through the wire when it is known that the data is all zeroes. This >> functionality is generally useful for mirroring or backup operations. >> >> Currently available NBD_CMD_TRIM command can not be used as the >> specification explicitely says that "a client MUST NOT make any > s/explicitely/explicitly/ > >> assumptions about the contents of the export affected by this >> [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`" >> >> Particular use case could be the following: >> >> QEMU project uses own implementation of NBD server to transfer data >> in between different instances of QEMU. Typically we tranfer VM virtual > s/tranfer/transfer/ > >> disks over this channel. VM virtual disks are sparse and thus the >> efficiency of backup and mirroring operations could be improved a lot. >> >> Signed-off-by: Denis V. Lunev <den@openvz.org> >> --- >> doc/proto.md | 7 +++++++ >> 1 file changed, 7 insertions(+) >> >> diff --git a/doc/proto.md b/doc/proto.md >> index 43065b7..c94751a 100644 >> --- a/doc/proto.md >> +++ b/doc/proto.md >> @@ -241,6 +241,8 @@ immediately after the global flags field in oldstyle negotiation: >> schedule I/O accesses as for a rotational medium >> - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports >> `NBD_CMD_TRIM` commands >> +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server >> + supports `NBD_CMD_WRITE_ZEROES` commands >> >> ##### Client flags >> >> @@ -446,6 +448,11 @@ The following request types exist: >> about the contents of the export affected by this command, until >> overwriting it again with `NBD_CMD_WRITE`. >> >> +* `NBD_CMD_WRITE_ZEROES` (6) >> + >> + A request to write zeroes. The command is functional equivalent of >> + the NBD_WRITE_COMMAND but without payload sent through the channel. > This lets us push holes during writes. from my point this allows client to apply his policy. For QCOW2 output target the client could skip the block. For RAW file he could decide whether to use UNMAP and produce sparse file or use fallocate. > Do we have the converse > operation, that is, an easy way to query if a block of data will read as > all zeroes, and therefore the client can bypass reading that portion of > the disk (in other words, an equivalent to lseek(SEEK_HOLE/SEEK_DATA))? > exactly! static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) ... ret = bdrv_get_block_status_above(source, NULL, sector_num, <------- query block state nb_sectors, &pnum, &file); if (ret < 0 || pnum < nb_sectors || (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) { bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors, mirror_read_complete, op); } else if (ret & BDRV_BLOCK_ZERO) { bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, <------ skip read op if allowed s->unmap ? BDRV_REQ_MAY_UNMAP : 0, mirror_write_complete, op); } else { assert(!(ret & BDRV_BLOCK_DATA)); bdrv_aio_discard(s->target, sector_num, op->nb_sectors, mirror_write_complete, op); } return delay_ns; Actually I have tried early at day begins to add .bdrv_co_write_zeroes callback to NBD and it just works as expected. The problem is that callback can not be written using NDB_SEND_TRIM to conform with the NBD spec. But in QEMU -> QEMU communication it just works. http://lists.nongnu.org/archive/html/qemu-devel/2016-02/msg03810.html Den
On 17 Feb 2016, at 18:10, Denis V. Lunev <den@openvz.org> wrote: > Currently available NBD_CMD_TRIM command can not be used as the > specification explicitely says that "a client MUST NOT make any > assumptions about the contents of the export affected by this > [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`" Would a flag to NBD_CMD_TRIM that says "ensure the written data is zeroed" not be an easier solution than adding another very similar command? Or (cough) changing the spec?
On 02/18/2016 07:46 AM, Denis V. Lunev wrote: > On 02/17/2016 11:58 PM, Eric Blake wrote: >> On 02/17/2016 11:10 AM, Denis V. Lunev wrote: >>> This patch proposes a new command to reduce the amount of data passed >>> through the wire when it is known that the data is all zeroes. This >>> functionality is generally useful for mirroring or backup operations. >>> >>> Currently available NBD_CMD_TRIM command can not be used as the >>> specification explicitely says that "a client MUST NOT make any >> s/explicitely/explicitly/ >> >>> assumptions about the contents of the export affected by this >>> [NBD_CMD_TRIM] command, until overwriting it again with >>> `NBD_CMD_WRITE`" >>> >>> Particular use case could be the following: >>> >>> QEMU project uses own implementation of NBD server to transfer data >>> in between different instances of QEMU. Typically we tranfer VM virtual >> s/tranfer/transfer/ >> >>> disks over this channel. VM virtual disks are sparse and thus the >>> efficiency of backup and mirroring operations could be improved a lot. >>> >>> Signed-off-by: Denis V. Lunev <den@openvz.org> >>> --- >>> doc/proto.md | 7 +++++++ >>> 1 file changed, 7 insertions(+) >>> >>> diff --git a/doc/proto.md b/doc/proto.md >>> index 43065b7..c94751a 100644 >>> --- a/doc/proto.md >>> +++ b/doc/proto.md >>> @@ -241,6 +241,8 @@ immediately after the global flags field in >>> oldstyle negotiation: >>> schedule I/O accesses as for a rotational medium >>> - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server >>> supports >>> `NBD_CMD_TRIM` commands >>> +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the >>> server >>> + supports `NBD_CMD_WRITE_ZEROES` commands >>> ##### Client flags >>> @@ -446,6 +448,11 @@ The following request types exist: >>> about the contents of the export affected by this command, until >>> overwriting it again with `NBD_CMD_WRITE`. >>> +* `NBD_CMD_WRITE_ZEROES` (6) >>> + >>> + A request to write zeroes. The command is functional equivalent of >>> + the NBD_WRITE_COMMAND but without payload sent through the >>> channel. >> This lets us push holes during writes. > from my point this allows client to apply his policy. For QCOW2 output > target the s/client/server/ Sorry, have mistyped.
On 02/18/2016 11:09 AM, Alex Bligh wrote: > On 17 Feb 2016, at 18:10, Denis V. Lunev <den@openvz.org> wrote: > >> Currently available NBD_CMD_TRIM command can not be used as the >> specification explicitely says that "a client MUST NOT make any >> assumptions about the contents of the export affected by this >> [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`" > Would a flag to NBD_CMD_TRIM that says "ensure the written > data is zeroed" not be an easier solution than adding another > very similar command? > > Or (cough) changing the spec? > from the point of the receiver the situation (from my POW) could be different. Let us assume that we are writing to the plain file. There are 2 type of queries: - pls make the target sparse, i.e. perform FALLOC_FL_PUNCH_HOLE and there is no problem that the operation could not be performed, this is a hint; - pls write the following amount of zeroes in either way (even calling write directly), i.e. ensure that the data is zeroed and the space on the file system is allocated for that. The difference comes from the situation when sparse files results in a serious performance impact if written by small pieces. In this case the data of one block of the virtual disk can be placed to very different blocks of the physical disk seriously impacting subsequent sequential reading. Thus I propose to make the situation distinct for these commands. Though you proposal to add NBD_FLAG_SEND_TRIM_ZEROED to export flags is also great. Though this IMHO is a different story. Den
On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote: > On 02/17/2016 11:10 AM, Denis V. Lunev wrote: > > @@ -446,6 +448,11 @@ The following request types exist: > > about the contents of the export affected by this command, until > > overwriting it again with `NBD_CMD_WRITE`. > > > > +* `NBD_CMD_WRITE_ZEROES` (6) > > + > > + A request to write zeroes. The command is functional equivalent of > > + the NBD_WRITE_COMMAND but without payload sent through the channel. > > This lets us push holes during writes. Do we have the converse > operation, that is, an easy way to query if a block of data will read as > all zeroes, and therefore the client can bypass reading that portion of > the disk (in other words, an equivalent to lseek(SEEK_HOLE/SEEK_DATA))? The spec doesn't have anything like that. OTOH, unlike the write case, where you have all the information and just choose whether to send normal write or zero write, the extra round-trip of a separate SEEK_HOLE/SEEK_DATA request may lead to actually degrading the overall throughput. Rather it may be a better idea to add something like sparse read where the server would, instead of sending the full length of data in the response payload, send a smarter variable-length package with a scatter-gather list or a bitmap of used blocks in the beginning, and let the client decode it and fill the gaps with zeros. Roman.
On 02/18/2016 12:18 PM, Roman Kagan wrote: > On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote: >> On 02/17/2016 11:10 AM, Denis V. Lunev wrote: >>> @@ -446,6 +448,11 @@ The following request types exist: >>> about the contents of the export affected by this command, until >>> overwriting it again with `NBD_CMD_WRITE`. >>> >>> +* `NBD_CMD_WRITE_ZEROES` (6) >>> + >>> + A request to write zeroes. The command is functional equivalent of >>> + the NBD_WRITE_COMMAND but without payload sent through the channel. >> This lets us push holes during writes. Do we have the converse >> operation, that is, an easy way to query if a block of data will read as >> all zeroes, and therefore the client can bypass reading that portion of >> the disk (in other words, an equivalent to lseek(SEEK_HOLE/SEEK_DATA))? > The spec doesn't have anything like that. > > OTOH, unlike the write case, where you have all the information and just > choose whether to send normal write or zero write, the extra round-trip > of a separate SEEK_HOLE/SEEK_DATA request may lead to actually degrading > the overall throughput. > > Rather it may be a better idea to add something like sparse read where > the server would, instead of sending the full length of data in the > response payload, send a smarter variable-length package with a > scatter-gather list or a bitmap of used blocks in the beginning, and let > the client decode it and fill the gaps with zeros. > > Roman. ah, I see. This story is more difficult but also viable for backup dirty bitmap reading. But this will make the protocol more complex and will require more efforts at specification stage. I'd better start with the current change, which is simple enough and make changes in a right direction and after that continue with READ2 or whatever command. Den
On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote: > On 02/17/2016 11:10 AM, Denis V. Lunev wrote: > > This patch proposes a new command to reduce the amount of data passed > > through the wire when it is known that the data is all zeroes. This > > functionality is generally useful for mirroring or backup operations. > > > > Currently available NBD_CMD_TRIM command can not be used as the > > specification explicitely says that "a client MUST NOT make any > > s/explicitely/explicitly/ > > > assumptions about the contents of the export affected by this > > [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`" > > > > Particular use case could be the following: > > > > QEMU project uses own implementation of NBD server to transfer data > > in between different instances of QEMU. Typically we tranfer VM virtual > > s/tranfer/transfer/ > > > disks over this channel. VM virtual disks are sparse and thus the > > efficiency of backup and mirroring operations could be improved a lot. > > > > Signed-off-by: Denis V. Lunev <den@openvz.org> > > --- > > doc/proto.md | 7 +++++++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/doc/proto.md b/doc/proto.md > > index 43065b7..c94751a 100644 > > --- a/doc/proto.md > > +++ b/doc/proto.md > > @@ -241,6 +241,8 @@ immediately after the global flags field in oldstyle negotiation: > > schedule I/O accesses as for a rotational medium > > - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports > > `NBD_CMD_TRIM` commands > > +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server > > + supports `NBD_CMD_WRITE_ZEROES` commands > > > > ##### Client flags > > > > @@ -446,6 +448,11 @@ The following request types exist: > > about the contents of the export affected by this command, until > > overwriting it again with `NBD_CMD_WRITE`. > > > > +* `NBD_CMD_WRITE_ZEROES` (6) > > + > > + A request to write zeroes. The command is functional equivalent of > > + the NBD_WRITE_COMMAND but without payload sent through the channel. > > This lets us push holes during writes. Do we have the converse > operation, that is, an easy way to query if a block of data will read as > all zeroes, and therefore the client can bypass reading that portion of > the disk (in other words, an equivalent to lseek(SEEK_HOLE/SEEK_DATA))? Stefan has suggested that we add a command to the NBD spec that implements the SCSI Get LBA Status command. This lets clients query the allocation bitmap for the device, which would serve this purpose. https://lists.gnu.org/archive/html/qemu-devel/2016-02/msg03582.html In that thread he talks about it being a way to serve up the dirty bitmap for live backup scenario, but in regular usage it obviously provides the normal allocation bitmap Regards, Daniel
On 02/18/2016 03:14 PM, Daniel P. Berrange wrote: > On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote: >> On 02/17/2016 11:10 AM, Denis V. Lunev wrote: >>> This patch proposes a new command to reduce the amount of data passed >>> through the wire when it is known that the data is all zeroes. This >>> functionality is generally useful for mirroring or backup operations. >>> >>> Currently available NBD_CMD_TRIM command can not be used as the >>> specification explicitely says that "a client MUST NOT make any >> s/explicitely/explicitly/ >> >>> assumptions about the contents of the export affected by this >>> [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`" >>> >>> Particular use case could be the following: >>> >>> QEMU project uses own implementation of NBD server to transfer data >>> in between different instances of QEMU. Typically we tranfer VM virtual >> s/tranfer/transfer/ >> >>> disks over this channel. VM virtual disks are sparse and thus the >>> efficiency of backup and mirroring operations could be improved a lot. >>> >>> Signed-off-by: Denis V. Lunev <den@openvz.org> >>> --- >>> doc/proto.md | 7 +++++++ >>> 1 file changed, 7 insertions(+) >>> >>> diff --git a/doc/proto.md b/doc/proto.md >>> index 43065b7..c94751a 100644 >>> --- a/doc/proto.md >>> +++ b/doc/proto.md >>> @@ -241,6 +241,8 @@ immediately after the global flags field in oldstyle negotiation: >>> schedule I/O accesses as for a rotational medium >>> - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports >>> `NBD_CMD_TRIM` commands >>> +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server >>> + supports `NBD_CMD_WRITE_ZEROES` commands >>> >>> ##### Client flags >>> >>> @@ -446,6 +448,11 @@ The following request types exist: >>> about the contents of the export affected by this command, until >>> overwriting it again with `NBD_CMD_WRITE`. >>> >>> +* `NBD_CMD_WRITE_ZEROES` (6) >>> + >>> + A request to write zeroes. The command is functional equivalent of >>> + the NBD_WRITE_COMMAND but without payload sent through the channel. >> This lets us push holes during writes. Do we have the converse >> operation, that is, an easy way to query if a block of data will read as >> all zeroes, and therefore the client can bypass reading that portion of >> the disk (in other words, an equivalent to lseek(SEEK_HOLE/SEEK_DATA))? > Stefan has suggested that we add a command to the NBD spec that > implements the SCSI Get LBA Status command. This lets clients > query the allocation bitmap for the device, which would serve > this purpose. > > https://lists.gnu.org/archive/html/qemu-devel/2016-02/msg03582.html > > In that thread he talks about it being a way to serve up the dirty > bitmap for live backup scenario, but in regular usage it obviously > provides the normal allocation bitmap > > > Regards, > Daniel But in this case we should allow to query the information for more than one block at once and also we will have to make an agreement in between the client and server about the granularity of the request or specify the granularity along as the range in the call. Den
On 02/18/2016 02:18 AM, Roman Kagan wrote: > On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote: >> On 02/17/2016 11:10 AM, Denis V. Lunev wrote: >>> @@ -446,6 +448,11 @@ The following request types exist: >>> about the contents of the export affected by this command, until >>> overwriting it again with `NBD_CMD_WRITE`. >>> >>> +* `NBD_CMD_WRITE_ZEROES` (6) >>> + >>> + A request to write zeroes. The command is functional equivalent of >>> + the NBD_WRITE_COMMAND but without payload sent through the channel. >> >> This lets us push holes during writes. Do we have the converse >> operation, that is, an easy way to query if a block of data will read as >> all zeroes, and therefore the client can bypass reading that portion of >> the disk (in other words, an equivalent to lseek(SEEK_HOLE/SEEK_DATA))? > > The spec doesn't have anything like that. > > OTOH, unlike the write case, where you have all the information and just > choose whether to send normal write or zero write, the extra round-trip > of a separate SEEK_HOLE/SEEK_DATA request may lead to actually degrading > the overall throughput. > > Rather it may be a better idea to add something like sparse read where > the server would, instead of sending the full length of data in the > response payload, send a smarter variable-length package with a > scatter-gather list or a bitmap of used blocks in the beginning, and let > the client decode it and fill the gaps with zeros. Sure, that would work too, and sounds nicer. Either way, the point is that we should strongly consider improving the NBD protocol to allow more efficient handling of sparse files, in both the push and in the pull direction. Qemu already has a desire to use both directions of improvements, but there are more programs, both clients and servers, outside of qemu, that could benefit from such protocol improvements.
On 02/18/2016 07:35 PM, Eric Blake wrote: > On 02/18/2016 02:18 AM, Roman Kagan wrote: >> On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote: >>> On 02/17/2016 11:10 AM, Denis V. Lunev wrote: >>>> @@ -446,6 +448,11 @@ The following request types exist: >>>> about the contents of the export affected by this command, until >>>> overwriting it again with `NBD_CMD_WRITE`. >>>> >>>> +* `NBD_CMD_WRITE_ZEROES` (6) >>>> + >>>> + A request to write zeroes. The command is functional equivalent of >>>> + the NBD_WRITE_COMMAND but without payload sent through the channel. >>> This lets us push holes during writes. Do we have the converse >>> operation, that is, an easy way to query if a block of data will read as >>> all zeroes, and therefore the client can bypass reading that portion of >>> the disk (in other words, an equivalent to lseek(SEEK_HOLE/SEEK_DATA))? >> The spec doesn't have anything like that. >> >> OTOH, unlike the write case, where you have all the information and just >> choose whether to send normal write or zero write, the extra round-trip >> of a separate SEEK_HOLE/SEEK_DATA request may lead to actually degrading >> the overall throughput. >> >> Rather it may be a better idea to add something like sparse read where >> the server would, instead of sending the full length of data in the >> response payload, send a smarter variable-length package with a >> scatter-gather list or a bitmap of used blocks in the beginning, and let >> the client decode it and fill the gaps with zeros. > Sure, that would work too, and sounds nicer. Either way, the point is > that we should strongly consider improving the NBD protocol to allow > more efficient handling of sparse files, in both the push and in the > pull direction. Qemu already has a desire to use both directions of > improvements, but there are more programs, both clients and servers, > outside of qemu, that could benefit from such protocol improvements. > OK Here is a short summary of features which seems necessary from QEMU point of view: - ability to avoid sending zeroes during write operation. The proposal comes in the thread-starter letter - ability to request block status (allocate/not allocated) from server. This seems interesting to preserve "sparseness" of the transferring data - ability to skip zeroes during read operation, i.e. something like READ2 command which will return vector of chunks as a reply All 3 features seem usable for generic NBD use-cases and not only for QEMU. If there are no objections I'll sum this up and come with a specification draft. Den P.S. I have added here all parties which have participated in conversation in different threads on QEMU side.
On 02/18/2016 10:23 AM, Denis V. Lunev wrote: > > Here is a short summary of features which seems necessary from QEMU > point of > view: > - ability to avoid sending zeroes during write operation. The proposal > comes in > the thread-starter letter > - ability to request block status (allocate/not allocated) from server. > This seems > interesting to preserve "sparseness" of the transferring data > - ability to skip zeroes during read operation, i.e. something like > READ2 command > which will return vector of chunks as a reply > > All 3 features seem usable for generic NBD use-cases and not only for QEMU. All three features must be negotiated as part of connection handshake. And we want to ensure sane fallbacks: Client - if the server does not support the features, we fall back to writing explicit zeroes (and give up on sparseness), and to assuming the entire image is non-sparse (can't query or read sparseness). Server - if client requests write 0, optimize where underlying storage allows it, but we can always fall back to explicitly writing 0s and merely treating the protocol as a compression of what is sent over the wire. If client requests block status, but underlying storage doesn't provide it, we can always fall back to claiming the entire image is allocated. If client requests RAED2 but underlying storage has no way to detect holes, we can always fall back to sending a single vector covering the entire read request (no compression). > > If there are no objections I'll sum this up and come with a > specification draft. Good luck! I'm sure you'll get good reviews. > > Den > > P.S. I have added here all parties which have participated in > conversation in > different threads on QEMU side. >
On 18 Feb 2016, at 17:23, Denis V. Lunev <den@openvz.org> wrote: > - ability to skip zeroes during read operation, i.e. something like > READ2 command > which will return vector of chunks as a reply ... > If there are no objections I'll sum this up and come with a > specification draft. If you are fixing READ2 to allow a vector based reply, please also consider allowing an error to be part of that vector. The protocol currently has an issue where it is non-obvious how to return an error encountered midway through a long read where some data has already been sent to the client (from memory i.e. unless it's already been fixed).
On 02/18/2016 08:23 PM, Denis V. Lunev wrote: > On 02/18/2016 07:35 PM, Eric Blake wrote: >> On 02/18/2016 02:18 AM, Roman Kagan wrote: >>> On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote: >>>> On 02/17/2016 11:10 AM, Denis V. Lunev wrote: >>>>> @@ -446,6 +448,11 @@ The following request types exist: >>>>> about the contents of the export affected by this command, >>>>> until >>>>> overwriting it again with `NBD_CMD_WRITE`. >>>>> +* `NBD_CMD_WRITE_ZEROES` (6) >>>>> + >>>>> + A request to write zeroes. The command is functional >>>>> equivalent of >>>>> + the NBD_WRITE_COMMAND but without payload sent through the >>>>> channel. >>>> This lets us push holes during writes. Do we have the converse >>>> operation, that is, an easy way to query if a block of data will >>>> read as >>>> all zeroes, and therefore the client can bypass reading that >>>> portion of >>>> the disk (in other words, an equivalent to >>>> lseek(SEEK_HOLE/SEEK_DATA))? >>> The spec doesn't have anything like that. >>> >>> OTOH, unlike the write case, where you have all the information and >>> just >>> choose whether to send normal write or zero write, the extra round-trip >>> of a separate SEEK_HOLE/SEEK_DATA request may lead to actually >>> degrading >>> the overall throughput. >>> >>> Rather it may be a better idea to add something like sparse read where >>> the server would, instead of sending the full length of data in the >>> response payload, send a smarter variable-length package with a >>> scatter-gather list or a bitmap of used blocks in the beginning, and >>> let >>> the client decode it and fill the gaps with zeros. >> Sure, that would work too, and sounds nicer. Either way, the point is >> that we should strongly consider improving the NBD protocol to allow >> more efficient handling of sparse files, in both the push and in the >> pull direction. Qemu already has a desire to use both directions of >> improvements, but there are more programs, both clients and servers, >> outside of qemu, that could benefit from such protocol improvements. >> > OK > > Here is a short summary of features which seems necessary from QEMU > point of > view: > - ability to avoid sending zeroes during write operation. The proposal > comes in > the thread-starter letter > - ability to request block status (allocate/not allocated) from > server. This seems > interesting to preserve "sparseness" of the transferring data > - ability to skip zeroes during read operation, i.e. something like > READ2 command > which will return vector of chunks as a reply > > All 3 features seem usable for generic NBD use-cases and not only for > QEMU. > > If there are no objections I'll sum this up and come with a > specification draft. > > Den > > P.S. I have added here all parties which have participated in > conversation in > different threads on QEMU side. interesting point from a verbal discussion with one of my friends. Protocol level compression could eliminate the necessity to think about zeroes in channel either from read or from write point of views and will also reduce the amount of data to transfer. Den
On 19.02.2016 10:12, Denis V. Lunev wrote: > On 02/18/2016 08:23 PM, Denis V. Lunev wrote: >> On 02/18/2016 07:35 PM, Eric Blake wrote: >>> On 02/18/2016 02:18 AM, Roman Kagan wrote: >>>> On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote: >>>>> On 02/17/2016 11:10 AM, Denis V. Lunev wrote: >>>>>> @@ -446,6 +448,11 @@ The following request types exist: >>>>>> about the contents of the export affected by this command, >>>>>> until >>>>>> overwriting it again with `NBD_CMD_WRITE`. >>>>>> +* `NBD_CMD_WRITE_ZEROES` (6) >>>>>> + >>>>>> + A request to write zeroes. The command is functional >>>>>> equivalent of >>>>>> + the NBD_WRITE_COMMAND but without payload sent through the >>>>>> channel. >>>>> This lets us push holes during writes. Do we have the converse >>>>> operation, that is, an easy way to query if a block of data will >>>>> read as >>>>> all zeroes, and therefore the client can bypass reading that >>>>> portion of >>>>> the disk (in other words, an equivalent to >>>>> lseek(SEEK_HOLE/SEEK_DATA))? >>>> The spec doesn't have anything like that. >>>> >>>> OTOH, unlike the write case, where you have all the information and >>>> just >>>> choose whether to send normal write or zero write, the extra >>>> round-trip >>>> of a separate SEEK_HOLE/SEEK_DATA request may lead to actually >>>> degrading >>>> the overall throughput. >>>> >>>> Rather it may be a better idea to add something like sparse read where >>>> the server would, instead of sending the full length of data in the >>>> response payload, send a smarter variable-length package with a >>>> scatter-gather list or a bitmap of used blocks in the beginning, >>>> and let >>>> the client decode it and fill the gaps with zeros. >>> Sure, that would work too, and sounds nicer. Either way, the point is >>> that we should strongly consider improving the NBD protocol to allow >>> more efficient handling of sparse files, in both the push and in the >>> pull direction. Qemu already has a desire to use both directions of >>> improvements, but there are more programs, both clients and servers, >>> outside of qemu, that could benefit from such protocol improvements. >>> >> OK >> >> Here is a short summary of features which seems necessary from QEMU >> point of >> view: >> - ability to avoid sending zeroes during write operation. The >> proposal comes in >> the thread-starter letter >> - ability to request block status (allocate/not allocated) from >> server. This seems >> interesting to preserve "sparseness" of the transferring data >> - ability to skip zeroes during read operation, i.e. something like >> READ2 command >> which will return vector of chunks as a reply >> >> All 3 features seem usable for generic NBD use-cases and not only for >> QEMU. >> >> If there are no objections I'll sum this up and come with a >> specification draft. >> >> Den >> >> P.S. I have added here all parties which have participated in >> conversation in >> different threads on QEMU side. > > interesting point from a verbal discussion with one of my friends. > Protocol level compression could eliminate the necessity to > think about zeroes in channel either from read or from write > point of views and will also reduce the amount of data to > transfer. > > Den Compression is worse than separate commands, because after decompression we will have to write or somehow test these zeroes again.
On Fri, Feb 19, 2016 at 10:12:09AM +0300, Denis V. Lunev wrote: > On 02/18/2016 08:23 PM, Denis V. Lunev wrote: > >On 02/18/2016 07:35 PM, Eric Blake wrote: > >>On 02/18/2016 02:18 AM, Roman Kagan wrote: > >>>On Wed, Feb 17, 2016 at 01:58:47PM -0700, Eric Blake wrote: > >>>>On 02/17/2016 11:10 AM, Denis V. Lunev wrote: > >>>>>@@ -446,6 +448,11 @@ The following request types exist: > >>>>> about the contents of the export affected by this command, > >>>>>until > >>>>> overwriting it again with `NBD_CMD_WRITE`. > >>>>> +* `NBD_CMD_WRITE_ZEROES` (6) > >>>>>+ > >>>>>+ A request to write zeroes. The command is functional > >>>>>equivalent of > >>>>>+ the NBD_WRITE_COMMAND but without payload sent through the > >>>>>channel. > >>>>This lets us push holes during writes. Do we have the converse > >>>>operation, that is, an easy way to query if a block of data will > >>>>read as > >>>>all zeroes, and therefore the client can bypass reading that portion > >>>>of > >>>>the disk (in other words, an equivalent to > >>>>lseek(SEEK_HOLE/SEEK_DATA))? > >>>The spec doesn't have anything like that. > >>> > >>>OTOH, unlike the write case, where you have all the information and > >>>just > >>>choose whether to send normal write or zero write, the extra round-trip > >>>of a separate SEEK_HOLE/SEEK_DATA request may lead to actually > >>>degrading > >>>the overall throughput. > >>> > >>>Rather it may be a better idea to add something like sparse read where > >>>the server would, instead of sending the full length of data in the > >>>response payload, send a smarter variable-length package with a > >>>scatter-gather list or a bitmap of used blocks in the beginning, and > >>>let > >>>the client decode it and fill the gaps with zeros. > >>Sure, that would work too, and sounds nicer. Either way, the point is > >>that we should strongly consider improving the NBD protocol to allow > >>more efficient handling of sparse files, in both the push and in the > >>pull direction. Qemu already has a desire to use both directions of > >>improvements, but there are more programs, both clients and servers, > >>outside of qemu, that could benefit from such protocol improvements. > >> > >OK > > > >Here is a short summary of features which seems necessary from QEMU point > >of > >view: > >- ability to avoid sending zeroes during write operation. The proposal > >comes in > > the thread-starter letter > >- ability to request block status (allocate/not allocated) from server. > >This seems > > interesting to preserve "sparseness" of the transferring data > >- ability to skip zeroes during read operation, i.e. something like READ2 > >command > > which will return vector of chunks as a reply > > > >All 3 features seem usable for generic NBD use-cases and not only for > >QEMU. > > > >If there are no objections I'll sum this up and come with a specification > >draft. > > > >Den > > > >P.S. I have added here all parties which have participated in conversation > >in > > different threads on QEMU side. > > interesting point from a verbal discussion with one of my friends. > Protocol level compression could eliminate the necessity to > think about zeroes in channel either from read or from write > point of views and will also reduce the amount of data to > transfer. With compression you have thrown away information about sparseness which you really want to have when writing out the file on the other end. It forces you to do memcmp detection of zero regions after decompression which is CPU intensive. Compression is a fine as a concept, but it is not a replacement for handling sparseness directly in the protocol. Regards, Daniel
Hi folks, (sorry about the lateness of this reply, was busy for the last few weeks) On Thu, Feb 18, 2016 at 11:34:04AM +0300, Denis V. Lunev wrote: > On 02/18/2016 11:09 AM, Alex Bligh wrote: > > On 17 Feb 2016, at 18:10, Denis V. Lunev <den@openvz.org> wrote: > > > >> Currently available NBD_CMD_TRIM command can not be used as the > >> specification explicitely says that "a client MUST NOT make any > >> assumptions about the contents of the export affected by this > >> [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`" > > Would a flag to NBD_CMD_TRIM that says "ensure the written > > data is zeroed" not be an easier solution than adding another > > very similar command? > > > > Or (cough) changing the spec? > > > from the point of the receiver the situation (from my POW) could > be different. Let us assume that we are writing to the plain > file. > > There are 2 type of queries: > - pls make the target sparse, i.e. perform FALLOC_FL_PUNCH_HOLE > and there is no problem that the operation could not be performed, > this is a hint; This is what NBD_CMD_TRIM does, currently. The reason this is a hint, is that there is no guarantee that the underlying operating system or storage even supports FALLOC_FL_PUNCH_HOLE (or similar). We could have made NBD_CMD_TRIM fail with a "not possible on this export" kind of error in that case, but it was chosen not to do that (for reasons I don't remember; maybe we just didn't consider this enough). This could be remedied if the client could somehow ask what the result of a TRIM command would be; i.e., if the server has support for FALLOC_FL_PUNCH_HOLE, it could set a flag which would let the client know that NBD_CMD_TRIM will zero out bytes. If the server doesn't set that flag and the client requires zeroes, it could then just issue a WRITE command, followed (maybe) by a TRIM for the same region (which would be less optimal, but have the same result with older servers) > - pls write the following amount of zeroes in either way (even calling > write directly), i.e. ensure that the data is zeroed and the space on > the file system is allocated for that. IOW, you *don't* want to have a sparse file in that case? Or do I misunderstand things here? I'm not entirely sure I understand the problem here...
Am 04.03.2016 um 09:49 hat Wouter Verhelst geschrieben: > Hi folks, > > (sorry about the lateness of this reply, was busy for the last few weeks) > > On Thu, Feb 18, 2016 at 11:34:04AM +0300, Denis V. Lunev wrote: > > On 02/18/2016 11:09 AM, Alex Bligh wrote: > > > On 17 Feb 2016, at 18:10, Denis V. Lunev <den@openvz.org> wrote: > > > > > >> Currently available NBD_CMD_TRIM command can not be used as the > > >> specification explicitely says that "a client MUST NOT make any > > >> assumptions about the contents of the export affected by this > > >> [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`" > > > Would a flag to NBD_CMD_TRIM that says "ensure the written > > > data is zeroed" not be an easier solution than adding another > > > very similar command? > > > > > > Or (cough) changing the spec? > > > > > from the point of the receiver the situation (from my POW) could > > be different. Let us assume that we are writing to the plain > > file. > > > > There are 2 type of queries: > > - pls make the target sparse, i.e. perform FALLOC_FL_PUNCH_HOLE > > and there is no problem that the operation could not be performed, > > this is a hint; > > This is what NBD_CMD_TRIM does, currently. > > The reason this is a hint, is that there is no guarantee that the > underlying operating system or storage even supports > FALLOC_FL_PUNCH_HOLE (or similar). We could have made NBD_CMD_TRIM fail > with a "not possible on this export" kind of error in that case, but it > was chosen not to do that (for reasons I don't remember; maybe we just > didn't consider this enough). > > This could be remedied if the client could somehow ask what the result > of a TRIM command would be; i.e., if the server has support for > FALLOC_FL_PUNCH_HOLE, it could set a flag which would let the client > know that NBD_CMD_TRIM will zero out bytes. If the server doesn't set > that flag and the client requires zeroes, it could then just issue a > WRITE command, followed (maybe) by a TRIM for the same region (which > would be less optimal, but have the same result with older servers) NBD_CMD_TRIM covers the case "I don't need this data any more, you can throw it away", and I think treating that purely as a hint is perfectly fine. > > - pls write the following amount of zeroes in either way (even calling > > write directly), i.e. ensure that the data is zeroed and the space on > > the file system is allocated for that. > > IOW, you *don't* want to have a sparse file in that case? Or do I > misunderstand things here? I think what we're looking for is more like "zero out this area, feel free to use whatever method is most efficient to achieve that". So if the server knows that the backing store supports an efficient way to write zeros (e.g. FALLOC_FL_ZERO_RANGE), it will use that. Otherwise, if TRIM works and we know that the result is zeroed space instead of undefined contents, the server is free to use it. And if even that fails, it just falls back to an explicit write of a zeroed buffer. If we want, we can give the client a little more control about whether or not discarding in the process is allowed (or maybe even preferred). qemu's interface for writing zeros has a BDRV_REQ_MAY_UNMAP flag, for example. Kevin
On 04/03/2016 10:54, Kevin Wolf wrote: >>> - pls write the following amount of zeroes in either way (even calling >>> > > write directly), i.e. ensure that the data is zeroed and the space on >>> > > the file system is allocated for that. >> > >> > IOW, you *don't* want to have a sparse file in that case? Or do I >> > misunderstand things here? > I think what we're looking for is more like "zero out this area, feel > free to use whatever method is most efficient to achieve that". > > So if the server knows that the backing store supports an efficient way > to write zeros (e.g. FALLOC_FL_ZERO_RANGE), it will use that. Otherwise, > if TRIM works and we know that the result is zeroed space instead of > undefined contents, the server is free to use it. And if even that > fails, it just falls back to an explicit write of a zeroed buffer. > > If we want, we can give the client a little more control about whether > or not discarding in the process is allowed (or maybe even preferred). > qemu's interface for writing zeros has a BDRV_REQ_MAY_UNMAP flag, for > example. NBD-wise, I think the TRIM command is good as it is, and NBD_CMD_WRITE_ZEROES should be added like Den is doing. It also makes sense to use trimming to implement NBD_CMD_WRITE_ZEROES, but it should be explicitly requested by the user. For this, my suggestion is that NBD_CMD_WRITE_ZEROES should have an NBD_FLAG_TRY_TRIM flag in bit 16. If specified, the backend can use a zero-writing mechanism that trims, _but_ it must ensure that the bytes read as zero. If it cannot ensure that, it must not trim and it should instead do a full write. This is similar to the SCSI command WRITE SAME (when the command payload is all zeroes). Like Kevin said, it also happens to map nicely to the QEMU block device layer. Paolo
Hi all, On Fri, Mar 04, 2016 at 03:03:26PM +0100, Paolo Bonzini wrote: > NBD-wise, I think the TRIM command is good as it is, and > NBD_CMD_WRITE_ZEROES should be added like Den is doing. > > It also makes sense to use trimming to implement NBD_CMD_WRITE_ZEROES, > but it should be explicitly requested by the user. For this, my > suggestion is that NBD_CMD_WRITE_ZEROES should have an > NBD_FLAG_TRY_TRIM flag in bit 16. If specified, the backend can use a > zero-writing mechanism that trims, _but_ it must ensure that the bytes > read as zero. If it cannot ensure that, it must not trim and it > should instead do a full write. This is similar to the SCSI command > WRITE SAME (when the command payload is all zeroes). Like Kevin said, > it also happens to map nicely to the QEMU block device layer. That seems like a sensible approach, yes. Would one of you who feels strongly about this be willing to write up a proposed spec for that? I could do it, but since I'm not 100% sure I understand all the specific requirements, I'm uncomfortable doing so. If that doesn't raise any obvious issues that I'm aware of, I'd be happy to add it to the "experimental" section of the proto.md file. (speaking of which, I notice that the STARTTLS patches got merged into qemu, so I've moved the description of that to the main body and out of the "experimental" part of that document)
On 03/06/2016 01:28 PM, Wouter Verhelst wrote: > Hi all, > > On Fri, Mar 04, 2016 at 03:03:26PM +0100, Paolo Bonzini wrote: >> NBD-wise, I think the TRIM command is good as it is, and >> NBD_CMD_WRITE_ZEROES should be added like Den is doing. >> >> It also makes sense to use trimming to implement NBD_CMD_WRITE_ZEROES, >> but it should be explicitly requested by the user. For this, my >> suggestion is that NBD_CMD_WRITE_ZEROES should have an >> NBD_FLAG_TRY_TRIM flag in bit 16. If specified, the backend can use a >> zero-writing mechanism that trims, _but_ it must ensure that the bytes >> read as zero. If it cannot ensure that, it must not trim and it >> should instead do a full write. This is similar to the SCSI command >> WRITE SAME (when the command payload is all zeroes). Like Kevin said, >> it also happens to map nicely to the QEMU block device layer. > That seems like a sensible approach, yes. > > Would one of you who feels strongly about this be willing to write up a > proposed spec for that? I could do it, but since I'm not 100% sure I > understand all the specific requirements, I'm uncomfortable doing so. > > If that doesn't raise any obvious issues that I'm aware of, I'd be happy > to add it to the "experimental" section of the proto.md file. > > (speaking of which, I notice that the STARTTLS patches got merged into > qemu, so I've moved the description of that to the main body and out of > the "experimental" part of that document) > I am going to send next RFC next Wednesday. Just waited other opinions not from QEMU side. Den
diff --git a/doc/proto.md b/doc/proto.md index 43065b7..c94751a 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -241,6 +241,8 @@ immediately after the global flags field in oldstyle negotiation: schedule I/O accesses as for a rotational medium - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports `NBD_CMD_TRIM` commands +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server + supports `NBD_CMD_WRITE_ZEROES` commands ##### Client flags @@ -446,6 +448,11 @@ The following request types exist: about the contents of the export affected by this command, until overwriting it again with `NBD_CMD_WRITE`. +* `NBD_CMD_WRITE_ZEROES` (6) + + A request to write zeroes. The command is functional equivalent of + the NBD_WRITE_COMMAND but without payload sent through the channel. + * Other requests Some third-party implementations may require additional protocol
This patch proposes a new command to reduce the amount of data passed through the wire when it is known that the data is all zeroes. This functionality is generally useful for mirroring or backup operations. Currently available NBD_CMD_TRIM command can not be used as the specification explicitely says that "a client MUST NOT make any assumptions about the contents of the export affected by this [NBD_CMD_TRIM] command, until overwriting it again with `NBD_CMD_WRITE`" Particular use case could be the following: QEMU project uses own implementation of NBD server to transfer data in between different instances of QEMU. Typically we tranfer VM virtual disks over this channel. VM virtual disks are sparse and thus the efficiency of backup and mirroring operations could be improved a lot. Signed-off-by: Denis V. Lunev <den@openvz.org> --- doc/proto.md | 7 +++++++ 1 file changed, 7 insertions(+)