Message ID | 1462524302-15558-1-git-send-email-quentin.casasnovas@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
[adding nbd-devel, qemu-block] On 05/06/2016 02:45 AM, Quentin Casasnovas wrote: > When running fstrim on a filesystem mounted through qemu-nbd with > --discard=on, fstrim would fail with I/O errors: > > $ fstrim /k/spl/ice/ > fstrim: /k/spl/ice/: FITRIM ioctl failed: Input/output error > > and qemu-nbd was spitting these: > > nbd.c:nbd_co_receive_request():L1232: len (94621696) is larger than max len (33554432) > > > The length of the request seems huge but this is really just the filesystem > telling the block device driver that "this length should be trimmed", and, > unlike for a NBD_CMD_READ or NBD_CMD_WRITE, we'll not try to read/write > that amount of data from/to the NBD socket. It is thus safe to remove the > length check for a NBD_CMD_TRIM. > > I've confirmed this with both the protocol documentation at: > > https://github.com/yoe/nbd/blob/master/doc/proto.md Hmm. The current wording of the experimental block size additions does NOT allow the client to send a NBD_CMD_TRIM with a size larger than the maximum NBD_CMD_WRITE: https://github.com/yoe/nbd/blob/extension-info/doc/proto.md#block-size-constraints Maybe we should revisit that in the spec, and/or advertise yet another block size (since the maximum size for a trim and/or write_zeroes request may indeed be different than the maximum size for a read/write). But since the kernel is the one sending the large length request, and since you are right that this is not a denial-of-service in the amount of data being sent in a single NBD message, I definitely agree that qemu would be wise as a quality-of-implementation to allow the larger size, for maximum interoperability, even if it exceeds advertised limits (that is, when no limits are advertised, we should handle everything possible if it is not so large as to be construed a denial-of-service, and NBD_CMD_TRIM is not large; and when limits ARE advertised, a client that violates limits is out of spec but we can still be liberal and respond successfully to such a client rather than having to outright reject it). So I think this patch is headed in the right direction. > > and looking at the kernel side implementation of the nbd device > (drivers/block/nbd.c) where it only sends the request header with no data > for a NBD_CMD_TRIM. > > With this fix in, I am now able to run fstrim on my qcow2 images and keep > them small (or at least keep their size proportional to the amount of data > present on them). > > Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> > CC: Paolo Bonzini <pbonzini@redhat.com> > CC: <qemu-devel@nongnu.org> > CC: <qemu-stable@nongnu.org> > CC: <qemu-trivial@nongnu.org> This is NOT trivial material and should not go in through that tree. However, I concur that it qualifies for a backport on a stable branch. > --- > nbd.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/nbd.c b/nbd.c > index b3d9654..e733669 100644 > --- a/nbd.c > +++ b/nbd.c > @@ -1209,6 +1209,11 @@ static ssize_t nbd_co_send_reply(NBDRequest *req, struct nbd_reply *reply, > return rc; > } > > +static bool nbd_should_check_request_size(const struct nbd_request *request) > +{ > + return (request->type & NBD_CMD_MASK_COMMAND) != NBD_CMD_TRIM; > +} > + > static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request *request) > { > NBDClient *client = req->client; > @@ -1227,7 +1232,8 @@ static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request *reque > goto out; > } > > - if (request->len > NBD_MAX_BUFFER_SIZE) { > + if (nbd_should_check_request_size(request) && > + request->len > NBD_MAX_BUFFER_SIZE) { I'd rather sort out the implications of this on the NBD protocol before taking anything into qemu. We've got time on our hand, so let's use it to get this right. (That, and I have several pending patches that conflict with this as part of adding WRITE_ZEROES and INFO_BLOCK_SIZE support, where it may be easier to resubmit this fix on top of my pending patches). > LOG("len (%u) is larger than max len (%u)", > request->len, NBD_MAX_BUFFER_SIZE); > rc = -EINVAL; >
Eric, > Hmm. The current wording of the experimental block size additions does > NOT allow the client to send a NBD_CMD_TRIM with a size larger than the > maximum NBD_CMD_WRITE: > https://github.com/yoe/nbd/blob/extension-info/doc/proto.md#block-size-constraints Correct > Maybe we should revisit that in the spec, and/or advertise yet another > block size (since the maximum size for a trim and/or write_zeroes > request may indeed be different than the maximum size for a read/write). I think it's up to the server to either handle large requests, or for the client to break these up. The core problem here is that the kernel (and, ahem, most servers) are ignorant of the block size extension, and need to guess how to break things up. In my view the client (kernel in this case) should be breaking the trim requests up into whatever size it uses as the maximum size write requests. But then it would have to know about block sizes which are in (another) experimental extension. I've nothing against you fixing it qemu's server (after all I don't think there is anything to /prohibit/ a server working on something larger than the maximum block size), and ... > But since the kernel is the one sending the large length request, and > since you are right that this is not a denial-of-service in the amount > of data being sent in a single NBD message, ... currently there isn't actually a maximum block size advertised (that being in another experimental addition), so this is just DoS prevention, which qemu is free to do how it wishes. > I definitely agree that qemu > would be wise as a quality-of-implementation to allow the larger size, > for maximum interoperability, even if it exceeds advertised limits (that > is, when no limits are advertised, we should handle everything possible > if it is not so large as to be construed a denial-of-service, and > NBD_CMD_TRIM is not large; and when limits ARE advertised, a client that > violates limits is out of spec but we can still be liberal and respond > successfully to such a client rather than having to outright reject it). > So I think this patch is headed in the right direction. I'd agree with that. What surprises me is that a release kernel is using experimental NBD extensions; there's no guarantee these won't change. Or does fstrim work some other way? Alex > >> >> and looking at the kernel side implementation of the nbd device >> (drivers/block/nbd.c) where it only sends the request header with no data >> for a NBD_CMD_TRIM. >> >> With this fix in, I am now able to run fstrim on my qcow2 images and keep >> them small (or at least keep their size proportional to the amount of data >> present on them). >> >> Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> >> CC: Paolo Bonzini <pbonzini@redhat.com> >> CC: <qemu-devel@nongnu.org> >> CC: <qemu-stable@nongnu.org> >> CC: <qemu-trivial@nongnu.org> > > This is NOT trivial material and should not go in through that tree. > However, I concur that it qualifies for a backport on a stable branch. > >> --- >> nbd.c | 8 +++++++- >> 1 file changed, 7 insertions(+), 1 deletion(-) >> >> diff --git a/nbd.c b/nbd.c >> index b3d9654..e733669 100644 >> --- a/nbd.c >> +++ b/nbd.c >> @@ -1209,6 +1209,11 @@ static ssize_t nbd_co_send_reply(NBDRequest *req, struct nbd_reply *reply, >> return rc; >> } >> >> +static bool nbd_should_check_request_size(const struct nbd_request *request) >> +{ >> + return (request->type & NBD_CMD_MASK_COMMAND) != NBD_CMD_TRIM; >> +} >> + >> static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request *request) >> { >> NBDClient *client = req->client; >> @@ -1227,7 +1232,8 @@ static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request *reque >> goto out; >> } >> >> - if (request->len > NBD_MAX_BUFFER_SIZE) { >> + if (nbd_should_check_request_size(request) && >> + request->len > NBD_MAX_BUFFER_SIZE) { > > I'd rather sort out the implications of this on the NBD protocol before > taking anything into qemu. We've got time on our hand, so let's use it > to get this right. (That, and I have several pending patches that > conflict with this as part of adding WRITE_ZEROES and INFO_BLOCK_SIZE > support, where it may be easier to resubmit this fix on top of my > pending patches). > >> LOG("len (%u) is larger than max len (%u)", >> request->len, NBD_MAX_BUFFER_SIZE); >> rc = -EINVAL; >> > > -- > Eric Blake eblake redhat com +1-919-301-3266 > Libvirt virtualization library http://libvirt.org > > ------------------------------------------------------------------------------ > Mobile security can be enabling, not merely restricting. Employees who > bring their own devices (BYOD) to work are irked by the imposition of MDM > restrictions. Mobile Device Manager Plus allows you to control only the > apps on BYO-devices by containerizing them, leaving personal data untouched! > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j_______________________________________________ > Nbd-general mailing list > Nbd-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nbd-general -- Alex Bligh
On 05/10/2016 09:08 AM, Alex Bligh wrote: > Eric, > >> Hmm. The current wording of the experimental block size additions does >> NOT allow the client to send a NBD_CMD_TRIM with a size larger than the >> maximum NBD_CMD_WRITE: >> https://github.com/yoe/nbd/blob/extension-info/doc/proto.md#block-size-constraints > > Correct > >> Maybe we should revisit that in the spec, and/or advertise yet another >> block size (since the maximum size for a trim and/or write_zeroes >> request may indeed be different than the maximum size for a read/write). > > I think it's up to the server to either handle large requests, or > for the client to break these up. But the question at hand here is whether we should permit servers to advertise multiple maximum block sizes (one for read/write, another one for trim/write_zero, or even two [at least qemu tracks a separate maximum trim vs. write_zero sizing in its generic block layer]), or merely stick with the current wording that requires clients that honor maximum block size to obey the same maximum for ALL commands, regardless of amount of data sent over the wire. > > The core problem here is that the kernel (and, ahem, most servers) are > ignorant of the block size extension, and need to guess how to break > things up. In my view the client (kernel in this case) should > be breaking the trim requests up into whatever size it uses as the > maximum size write requests. But then it would have to know about block > sizes which are in (another) experimental extension. Correct - no one has yet patched the kernel to honor block sizes advertised through what is currently an experimental extension. (We have ioctl(NBD_SET_BLKSIZE) which can be argued to set the kernel's minimum block size, but I haven't audited whether the kernel actually guarantees that all client requests are sent aligned to the value passed that way - but we have nothing to set the maximum size, and are at the mercy of however the kernel currently decides to split large requests). So the kernel is currently one of the clients that does NOT honor block sizes, and as such, servers should be prepared for ANY size up to UINT_MAX (other than DoS handling). My question above only applies to clients that use the experimental block size extensions. > > I've nothing against you fixing it qemu's server (after all I don't > think there is anything to /prohibit/ a server working on something > larger than the maximum block size), and ... > >> But since the kernel is the one sending the large length request, and >> since you are right that this is not a denial-of-service in the amount >> of data being sent in a single NBD message, > > ... currently there isn't actually a maximum block size advertised (that > being in another experimental addition), so this is just DoS prevention, > which qemu is free to do how it wishes. Okay, sounds like that part is uncontroversial, and it is indeed worth improving qemu's behavior. > > What surprises me is that a release kernel is using experimental > NBD extensions; there's no guarantee these won't change. Or does > fstrim work some other way? No extension in play. The kernel is obeying NBD_FLAG_SEND_TRIM, which is in the normative standard, and unrelated to the INFO/BLOCK_SIZE extensions.
Eric, On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote: >>> Maybe we should revisit that in the spec, and/or advertise yet another >>> block size (since the maximum size for a trim and/or write_zeroes >>> request may indeed be different than the maximum size for a read/write). >> >> I think it's up to the server to either handle large requests, or >> for the client to break these up. > > But the question at hand here is whether we should permit servers to > advertise multiple maximum block sizes (one for read/write, another one > for trim/write_zero, or even two [at least qemu tracks a separate > maximum trim vs. write_zero sizing in its generic block layer]), or > merely stick with the current wording that requires clients that honor > maximum block size to obey the same maximum for ALL commands, regardless > of amount of data sent over the wire. In my view, we should not change this. Block sizes maxima are not there to support DoS prevention (that's a separate phrase). They are there to impose maximum block sizes. Adding a different maximum block size for different commands is way too overengineered. There are after all reasons (especially without structured replies) why you'd want different maximum block sizes for writes and reads. If clients support block sizes, they will necessarily have to have the infrastructure to break requests up. IE maximum block size should continue to mean maximum block size. >> >> The core problem here is that the kernel (and, ahem, most servers) are >> ignorant of the block size extension, and need to guess how to break >> things up. In my view the client (kernel in this case) should >> be breaking the trim requests up into whatever size it uses as the >> maximum size write requests. But then it would have to know about block >> sizes which are in (another) experimental extension. > > Correct - no one has yet patched the kernel to honor block sizes > advertised through what is currently an experimental extension. Unsurprising at it's still experimental, and only settled down a couple of weeks ago :-) > (We > have ioctl(NBD_SET_BLKSIZE) which can be argued to set the kernel's > minimum block size, Technically that is 'out of band transmission of block size constraints' :-) > but I haven't audited whether the kernel actually > guarantees that all client requests are sent aligned to the value passed > that way - but we have nothing to set the maximum size, indeed > and are at the > mercy of however the kernel currently decides to split large requests). I am surprised TRIM doesn't get broken up the same way READ and WRITE do. > So the kernel is currently one of the clients that does NOT honor block > sizes, and as such, servers should be prepared for ANY size up to > UINT_MAX (other than DoS handling). Or not to permit a connection. > My question above only applies to > clients that use the experimental block size extensions. Indeed. >> What surprises me is that a release kernel is using experimental >> NBD extensions; there's no guarantee these won't change. Or does >> fstrim work some other way? > > No extension in play. The kernel is obeying NBD_FLAG_SEND_TRIM, which > is in the normative standard, and unrelated to the INFO/BLOCK_SIZE > extensions. My mistake. I was confusing 'WRITE_ZEROES' with 'TRIM'. -- Alex Bligh
On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote: > So the kernel is currently one of the clients that does NOT honor block > sizes, and as such, servers should be prepared for ANY size up to > UINT_MAX (other than DoS handling). Interesting followup question: If the kernel does not fragment TRIM requests at all (in the same way it fragments read and write requests), I suspect something bad may happen with TRIM requests over 2^31 in size (particularly over 2^32 in size), as the length field in nbd only has 32 bits. Whether it supports block size constraints or not, it is going to need to do *some* breaking up of requests. -- Alex Bligh
On Tue, May 10, 2016 at 04:38:29PM +0100, Alex Bligh wrote: > Eric, > > On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote: > >>> Maybe we should revisit that in the spec, and/or advertise yet another > >>> block size (since the maximum size for a trim and/or write_zeroes > >>> request may indeed be different than the maximum size for a read/write). > >> > >> I think it's up to the server to either handle large requests, or > >> for the client to break these up. > > > > But the question at hand here is whether we should permit servers to > > advertise multiple maximum block sizes (one for read/write, another one > > for trim/write_zero, or even two [at least qemu tracks a separate > > maximum trim vs. write_zero sizing in its generic block layer]), or > > merely stick with the current wording that requires clients that honor > > maximum block size to obey the same maximum for ALL commands, regardless > > of amount of data sent over the wire. > > In my view, we should not change this. Block sizes maxima are not there > to support DoS prevention (that's a separate phrase). They are there > to impose maximum block sizes. Adding a different maximum block size > for different commands is way too overengineered. There are after > all reasons (especially without structured replies) why you'd want > different maximum block sizes for writes and reads. If clients support > block sizes, they will necessarily have to have the infrastructure > to break requests up. > > IE maximum block size should continue to mean maximum block size. > > >> > >> The core problem here is that the kernel (and, ahem, most servers) are > >> ignorant of the block size extension, and need to guess how to break > >> things up. In my view the client (kernel in this case) should > >> be breaking the trim requests up into whatever size it uses as the > >> maximum size write requests. But then it would have to know about block > >> sizes which are in (another) experimental extension. > > > > Correct - no one has yet patched the kernel to honor block sizes > > advertised through what is currently an experimental extension. > > Unsurprising at it's still experimental, and only settled down a couple > of weeks ago :-) > > > (We > > have ioctl(NBD_SET_BLKSIZE) which can be argued to set the kernel's > > minimum block size, > > Technically that is 'out of band transmission of block size > constraints' :-) > > > but I haven't audited whether the kernel actually > > guarantees that all client requests are sent aligned to the value passed > > that way - but we have nothing to set the maximum size, > > indeed > > > and are at the > > mercy of however the kernel currently decides to split large requests). > > I am surprised TRIM doesn't get broken up the same way READ and WRITE > do. > I'm by no mean an expert in this, but why would the kernel break up those TRIM commands? After all, breaking things up makes sense when the length of the request is big, not that much when it only contains the request header, which is the case for TRIM commands. What am I missing? Quentin
On 05/10/2016 09:41 AM, Alex Bligh wrote: > > On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote: > >> So the kernel is currently one of the clients that does NOT honor block >> sizes, and as such, servers should be prepared for ANY size up to >> UINT_MAX (other than DoS handling). > > Interesting followup question: > > If the kernel does not fragment TRIM requests at all (in the > same way it fragments read and write requests), I suspect > something bad may happen with TRIM requests over 2^31 > in size (particularly over 2^32 in size), as the length > field in nbd only has 32 bits. > > Whether it supports block size constraints or not, it is > going to need to do *some* breaking up of requests. Does anyone have an easy way to cause the kernel to request a trim operation that large on a > 4G export? I'm not familiar enough with EXT4 operation to know what file system operations you can run to ultimately indirectly create a file system trim operation that large. But maybe there is something simpler - does the kernel let you use the fallocate(2) syscall operation with FALLOC_FL_PUNCH_HOLE or FALLOC_FL_ZERO_RANGE on an fd backed by an NBD device?
On 10 May 2016, at 16:45, Quentin Casasnovas <quentin.casasnovas@oracle.com> wrote: > I'm by no mean an expert in this, but why would the kernel break up those > TRIM commands? After all, breaking things up makes sense when the length > of the request is big, not that much when it only contains the request > header, which is the case for TRIM commands. 1. You are assuming that the only reason for limiting the size of operations is limiting the data transferred within one request. That is not necessarily the case. There are good reasons (if only orthogonality) to have limits in place even where no data is transferred. 2. As and when the blocksize extension is implemented in the kernel (it isn't now), the protocol requires it. 3. The maximum length of an NBD trim operation is 2^32. The maximum length of a trim operation is larger. Therefore the kernel needs to do at least some breaking up.
On 10 May 2016, at 16:46, Eric Blake <eblake@redhat.com> wrote: > Does anyone have an easy way to cause the kernel to request a trim > operation that large on a > 4G export? I'm not familiar enough with > EXT4 operation to know what file system operations you can run to > ultimately indirectly create a file system trim operation that large. > But maybe there is something simpler - does the kernel let you use the > fallocate(2) syscall operation with FALLOC_FL_PUNCH_HOLE or > FALLOC_FL_ZERO_RANGE on an fd backed by an NBD device? Not tried it, but fallocate(1) with -p ? http://man7.org/linux/man-pages/man1/fallocate.1.html As it takes length and offset in TB, PB, EB, ZB and YB, it seems to be 64 bit aware :-) -- Alex Bligh
On Tue, May 10, 2016 at 09:46:36AM -0600, Eric Blake wrote: > On 05/10/2016 09:41 AM, Alex Bligh wrote: > > > > On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote: > > > >> So the kernel is currently one of the clients that does NOT honor block > >> sizes, and as such, servers should be prepared for ANY size up to > >> UINT_MAX (other than DoS handling). > > > > Interesting followup question: > > > > If the kernel does not fragment TRIM requests at all (in the > > same way it fragments read and write requests), I suspect > > something bad may happen with TRIM requests over 2^31 > > in size (particularly over 2^32 in size), as the length > > field in nbd only has 32 bits. > > > > Whether it supports block size constraints or not, it is > > going to need to do *some* breaking up of requests. > > Does anyone have an easy way to cause the kernel to request a trim > operation that large on a > 4G export? I'm not familiar enough with > EXT4 operation to know what file system operations you can run to > ultimately indirectly create a file system trim operation that large. > But maybe there is something simpler - does the kernel let you use the > fallocate(2) syscall operation with FALLOC_FL_PUNCH_HOLE or > FALLOC_FL_ZERO_RANGE on an fd backed by an NBD device? > It was fairly reproducible here, we just used a random qcow2 image with some Debian minimal system pre-installed, mounted that qcow2 image through qemu-nbd then compiled a whole kernel inside it. Then you can make clean and run fstrim on the mount point. I'm assuming you can go faster than that by just writing a big file to the qcow2 image mounted without -o discard, delete the big file, then remount with -o discard + run fstrim. Quentin
On Tue, May 10, 2016 at 04:49:57PM +0100, Alex Bligh wrote: > > On 10 May 2016, at 16:45, Quentin Casasnovas <quentin.casasnovas@oracle.com> wrote: > > > I'm by no mean an expert in this, but why would the kernel break up those > > TRIM commands? After all, breaking things up makes sense when the length > > of the request is big, not that much when it only contains the request > > header, which is the case for TRIM commands. > > 1. You are assuming that the only reason for limiting the size of operations is > limiting the data transferred within one request. That is not necessarily > the case. There are good reasons (if only orthogonality) to have limits > in place even where no data is transferred. > > 2. As and when the blocksize extension is implemented in the kernel (it isn't > now), the protocol requires it. > > 3. The maximum length of an NBD trim operation is 2^32. The maximum length > of a trim operation is larger. Therefore the kernel needs to do at least > some breaking up. Alright, I assumed by 'length of an NBD request', the specification was talking about the length of.. well, the request as opposed to whatever is in the length field of the request header. Is there a use case where you'd want to split up a single big TRIM request in smaller ones (as in some hardware would not support it or something)? Even then, it looks like this splitting up would be hardware dependant and better implemented in block device drivers. I'm just finding odd that something that fits inside the length field can't be used. I do agree with your point number 3, obviously if the lenght field type doesn't allow something bigger than a u32, then the kernel has to do some breaking up in that case. Quentin
On 10 May 2016, at 17:04, Quentin Casasnovas <quentin.casasnovas@oracle.com> wrote: > Alright, I assumed by 'length of an NBD request', the specification was > talking about the length of.. well, the request as opposed to whatever is > in the length field of the request header. With NBD_CMD_TRIM the length in the header field is 32 bit and specifies the length of data to trim, not the length of the data transferred (which is none). > Is there a use case where you'd want to split up a single big TRIM request > in smaller ones (as in some hardware would not support it or something)? > Even then, it looks like this splitting up would be hardware dependant and > better implemented in block device drivers. Part of the point of the block size extension is to push such limits to the client. I could make up use cases (e.g. that performing a multi-gigabyte trim in a single threaded server will effectively block all other I/O), but the main reason in my book is orthogonality, and the fact the client needs to do some breaking up anyway. > I'm just finding odd that something that fits inside the length field can't > be used. That's a different point. That's Qemu's 'Denial of Service Attack' prevention, *not* maximum block sizes. It isn't dropping it because of a maximum block size parameter. If it doesn't support the block size extension which the version you're looking at does not, it's meant to handle requests up to 2^32-1 long EXCEPT that it MAY error requests so long as to cause a denial of service attack. As this doesn't fit into that case (it's a TRIM), it shouldn't be erroring it on that grounds. I agree Qemu should fix that. (So in a sense Eric and I are arguing about something irrelevant to your current problem, which is how this would work /with/ the block size extensions, as Eric is in the process of implementing them). > I do agree with your point number 3, obviously if the lenght > field type doesn't allow something bigger than a u32, then the kernel has > to do some breaking up in that case.
On Tue, May 10, 2016 at 05:23:21PM +0100, Alex Bligh wrote: > > On 10 May 2016, at 17:04, Quentin Casasnovas <quentin.casasnovas@oracle.com> wrote: > > > Alright, I assumed by 'length of an NBD request', the specification was > > talking about the length of.. well, the request as opposed to whatever is > > in the length field of the request header. > > With NBD_CMD_TRIM the length in the header field is 32 bit and specifies > the length of data to trim, not the length of the data transferred (which > is none). > > > Is there a use case where you'd want to split up a single big TRIM request > > in smaller ones (as in some hardware would not support it or something)? > > Even then, it looks like this splitting up would be hardware dependant and > > better implemented in block device drivers. > > Part of the point of the block size extension is to push such limits to the > client. > > I could make up use cases (e.g. that performing a multi-gigabyte trim in > a single threaded server will effectively block all other I/O), but the > main reason in my book is orthogonality, and the fact the client needs > to do some breaking up anyway. > > > I'm just finding odd that something that fits inside the length field can't > > be used. > > That's a different point. That's Qemu's 'Denial of Service Attack' > prevention, *not* maximum block sizes. It isn't dropping it because > of a maximum block size parameter. If it doesn't support the block size > extension which the version you're looking at does not, it's meant > to handle requests up to 2^32-1 long EXCEPT that it MAY error requests > so long as to cause a denial of service attack. As this doesn't fit > into that case (it's a TRIM), it shouldn't be erroring it on that grounds. > > I agree Qemu should fix that. > > (So in a sense Eric and I are arguing about something irrelevant to > your current problem, which is how this would work /with/ the block > size extensions, as Eric is in the process of implementing them). > Riight! OK understood, thanks for the explanation. Quentin
On Tue, May 10, 2016 at 05:54:44PM +0200, Quentin Casasnovas wrote: > On Tue, May 10, 2016 at 09:46:36AM -0600, Eric Blake wrote: > > On 05/10/2016 09:41 AM, Alex Bligh wrote: > > > > > > On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote: > > > > > >> So the kernel is currently one of the clients that does NOT honor block > > >> sizes, and as such, servers should be prepared for ANY size up to > > >> UINT_MAX (other than DoS handling). > > > > > > Interesting followup question: > > > > > > If the kernel does not fragment TRIM requests at all (in the > > > same way it fragments read and write requests), I suspect > > > something bad may happen with TRIM requests over 2^31 > > > in size (particularly over 2^32 in size), as the length > > > field in nbd only has 32 bits. > > > > > > Whether it supports block size constraints or not, it is > > > going to need to do *some* breaking up of requests. > > > > Does anyone have an easy way to cause the kernel to request a trim > > operation that large on a > 4G export? I'm not familiar enough with > > EXT4 operation to know what file system operations you can run to > > ultimately indirectly create a file system trim operation that large. > > But maybe there is something simpler - does the kernel let you use the > > fallocate(2) syscall operation with FALLOC_FL_PUNCH_HOLE or > > FALLOC_FL_ZERO_RANGE on an fd backed by an NBD device? > > > > It was fairly reproducible here, we just used a random qcow2 image with > some Debian minimal system pre-installed, mounted that qcow2 image through > qemu-nbd then compiled a whole kernel inside it. Then you can make clean > and run fstrim on the mount point. I'm assuming you can go faster than > that by just writing a big file to the qcow2 image mounted without -o > discard, delete the big file, then remount with -o discard + run fstrim. > Looks like there's an easier way: $ qemu-img create -f qcow2 foo.qcow2 10G $ qemu-nbd --discard=on -c /dev/nbd0 foo.qcow2 $ mkfs.ext4 /dev/nbd0 mke2fs 1.42.13 (17-May-2015) Discarding device blocks: failed - Input/output error Creating filesystem with 2621440 4k blocks and 655360 inodes Filesystem UUID: 25aeb51f-0dea-4c1d-8b65-61f6bcdf97e9 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done Notice the "Discarding device blocks: failed - Input/output error" line, I bet that it is mkfs.ext4 trying to trim all blocks prior to writing the filesystem, but it gets an I/O error while doing so. I haven't verified it is the same problem, but it it isn't, simply mount the resulting filesystem and run fstrim on it: $ mount -o discard /dev/nbd0 /tmp/foo $ fstrim /tmp/foo fstrim: /tmp/foo: FITRIM ioctl failed: Input/output error Quentin
On 10/05/2016 17:38, Alex Bligh wrote: > > and are at the > > mercy of however the kernel currently decides to split large requests). > > I am surprised TRIM doesn't get broken up the same way READ and WRITE > do. The payload of TRIM has constant size, so it makes sense to do that. The kernel then self-imposes a 2GB limit in blkdev_issue_discard. On the other hand, READ and WRITE of size n can possibly require O(n) memory. Paolo
On Tue, May 10, 2016 at 04:41:27PM +0100, Alex Bligh wrote: > >On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote: > >> So the kernel is currently one of the clients that does NOT honor block >> sizes, and as such, servers should be prepared for ANY size up to >> UINT_MAX (other than DoS handling). > >Interesting followup question: > >If the kernel does not fragment TRIM requests at all (in the >same way it fragments read and write requests), I suspect >something bad may happen with TRIM requests over 2^31 >in size (particularly over 2^32 in size), as the length >field in nbd only has 32 bits. > >Whether it supports block size constraints or not, it is >going to need to do *some* breaking up of requests. Doesn't the kernel break TRIM requests at max_discard_sectors? I remember testing the following change in the nbd kmod long time ago: #if 0 disk->queue->limits.max_discard_sectors = UINT_MAX; #else disk->queue->limits.max_discard_sectors = 65536; #endif The problem with large TRIM requests is exactly the same as with other (READ/WRITE) large requests -- they _may_ take loads of time and if the client wants to support a fast switch over to another replica it must put some constraints on the timeout value... which may be very easily broken if you allow things like a 1GB trim request on the server using DIO on the other side (and say heavily fragmented sparse file over XFS, never mind). I don't think it's the problem of QEMU NBD server to fix that, the constraint on the server side is perfectly fine, the problem is to note that a change on the client side is required to limit the maximum size of the TRIM requests. The reason for the patch I pasted above was that at the time I looked into it there was no other way to change the TRIM request size via /sys/block/$dev/queue/max_discard_sectors, but maybe the more recent kernels allow that? Not to mention that the NBD client option to set that at NBD queue setup time would be nice... Just my 2p.
On 05/10/2016 10:33 AM, Quentin Casasnovas wrote: > Looks like there's an easier way: > > $ qemu-img create -f qcow2 foo.qcow2 10G > $ qemu-nbd --discard=on -c /dev/nbd0 foo.qcow2 > $ mkfs.ext4 /dev/nbd0 > mke2fs 1.42.13 (17-May-2015) > Discarding device blocks: failed - Input/output error strace on mkfs.ext4 shows: ... open("/dev/nbd1", O_RDWR|O_EXCL) = 3 stat("/dev/nbd1", {st_mode=S_IFBLK|0660, st_rdev=makedev(43, 1), ...}) = 0 ioctl(3, BLKDISCARDZEROES, [0]) = 0 ioctl(3, BLKROGET, [0]) = 0 uname({sysname="Linux", nodename="red", ...}) = 0 ioctl(3, BLKDISCARD, [0, 1000000]) = 0 write(1, "Discarding device blocks: ", 26) = 26 write(1, " 4096/2621440", 15 4096/2621440) = 15 write(1, "\10\10\10\10\10\10\10\10\10\10\10\1) = 15 ioctl(3, BLKDISCARD, [1000000, 80000000]) = -1 EIO (Input/output error) write(1, " ", 15 ) = 15 write(1, "\10\10\10\10\10\10\10\10\10\10\10\1) = 15 write(1, "failed - ", 9failed - ) = 9 so it does indeed look like the smaller request [0, 0x1000000] succeeded while the larger [0x1000000, 0x80000000] failed with EIO. But I instrumented my qemu-nbd server to output all of the incoming requests it was receiving from the kernel, and I never saw a mention of NBD_CMD_TRIM being received. Maybe it's dependent on what version of the kernel and/or NBD kernel module you are running? I also tried fallocate(1) on /dev/nbd1, as in: # strace fallocate -o 0 -l 64k -p /dev/nbd1 but it looks like the kernel doesn't yet wire up fallocate(2) to forward to the appropriate device ioctls and/or NBD_CMD_TRIM, where strace says: open("/dev/nbd1", O_RDWR) = 3 fallocate(3, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 0, 65536) = -1 ENODEV (No such device) The ENODEV is a rather unusual choice of error.
On 05/06/2016 02:45 AM, Quentin Casasnovas wrote: > When running fstrim on a filesystem mounted through qemu-nbd with > --discard=on, fstrim would fail with I/O errors: > > $ fstrim /k/spl/ice/ > fstrim: /k/spl/ice/: FITRIM ioctl failed: Input/output error > > and qemu-nbd was spitting these: > > nbd.c:nbd_co_receive_request():L1232: len (94621696) is larger than max len (33554432) Your patch duplicates what is already present in qemu: commit eb38c3b67018ff8069e4f674a28661931a8a3e4f Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Jan 7 14:32:42 2016 +0100 nbd-server: do not check request length except for reads and writes Only reads and writes need to allocate memory correspondent to the request length. Other requests can be sent to the storage without allocating any memory, and thus any request length is acceptable. Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com> Cc: qemu-block@nongnu.org Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> For the purposes of qemu-stable, it's better to backport the existing patch than to write a new version of it. It also helps to state what version of qemu you were testing, as it is obviously not the (soon-to-be-released) version 2.6 which already has the fix. > nbd.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/nbd.c b/nbd.c > index b3d9654..e733669 100644 > --- a/nbd.c > +++ b/nbd.c > @@ -1209,6 +1209,11 @@ static ssize_t nbd_co_send_reply(NBDRequest *req, struct nbd_reply *reply, > return rc; > } > > +static bool nbd_should_check_request_size(const struct nbd_request *request) > +{ > + return (request->type & NBD_CMD_MASK_COMMAND) != NBD_CMD_TRIM; > +} > + > static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request *request) > { > NBDClient *client = req->client; > @@ -1227,7 +1232,8 @@ static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request *reque > goto out; > } > > - if (request->len > NBD_MAX_BUFFER_SIZE) { > + if (nbd_should_check_request_size(request) && > + request->len > NBD_MAX_BUFFER_SIZE) { > LOG("len (%u) is larger than max len (%u)", > request->len, NBD_MAX_BUFFER_SIZE); > rc = -EINVAL; >
On Tue, May 10, 2016 at 02:34:18PM -0600, Eric Blake wrote: > On 05/06/2016 02:45 AM, Quentin Casasnovas wrote: > > When running fstrim on a filesystem mounted through qemu-nbd with > > --discard=on, fstrim would fail with I/O errors: > > > > $ fstrim /k/spl/ice/ > > fstrim: /k/spl/ice/: FITRIM ioctl failed: Input/output error > > > > and qemu-nbd was spitting these: > > > > nbd.c:nbd_co_receive_request():L1232: len (94621696) is larger than max len (33554432) > > Your patch duplicates what is already present in qemu: > > commit eb38c3b67018ff8069e4f674a28661931a8a3e4f > Author: Paolo Bonzini <pbonzini@redhat.com> > Date: Thu Jan 7 14:32:42 2016 +0100 > > nbd-server: do not check request length except for reads and writes > > Only reads and writes need to allocate memory correspondent to the > request length. Other requests can be sent to the storage without > allocating any memory, and thus any request length is acceptable. > > Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com> > Cc: qemu-block@nongnu.org > Reviewed-by: Max Reitz <mreitz@redhat.com> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > > For the purposes of qemu-stable, it's better to backport the existing > patch than to write a new version of it. > Ha sorry I missed this! I wouldn't have tried to debug/fix myself otherwise :) > It also helps to state what version of qemu you were testing, as it is > obviously not the (soon-to-be-released) version 2.6 which already has > the fix. > I was using qemu-2.5.0-rc3 on Gentoo but this was also verified on some Debian systems which appears to be on 1.1.2+dfsg-6a+deb7u12 and on Ubuntu Xenial 2.0.0+dfsg-2ubuntu1.22. I wrote the patch on top of https://github.com/bonzini/qemu.git:master (a7e00e2) which didn't contain the fix last Friday. Anyway, cool if the fix is going into mainline :) Quentin
On 10/05/2016 18:23, Alex Bligh wrote: > > Is there a use case where you'd want to split up a single big TRIM request > > in smaller ones (as in some hardware would not support it or something)? > > Even then, it looks like this splitting up would be hardware dependant and > > better implemented in block device drivers. > > Part of the point of the block size extension is to push such limits to the > client. > > I could make up use cases (e.g. that performing a multi-gigabyte trim in > a single threaded server will effectively block all other I/O), but the > main reason in my book is orthogonality, and the fact the client needs > to do some breaking up anyway. That's why SCSI for example has a limit to UNMAP and WRITE SAME requests (actually it has three limits: number of ranges per unmap, which in NBD and in SCSI WRITE SAME is 1; number of blocks per unmap descriptor; number of blocks per WRITE SAME operation). These limits however are a different one than the read/write limit, because you want to support systems where TRIM is much faster than regular I/O (and possibly even O(1) if trimming something that is already trimmed). Paolo
On 05/11/2016 03:38 AM, Paolo Bonzini wrote: > > > On 10/05/2016 18:23, Alex Bligh wrote: >>> Is there a use case where you'd want to split up a single big TRIM request >>> in smaller ones (as in some hardware would not support it or something)? >>> Even then, it looks like this splitting up would be hardware dependant and >>> better implemented in block device drivers. >> >> Part of the point of the block size extension is to push such limits to the >> client. >> >> I could make up use cases (e.g. that performing a multi-gigabyte trim in >> a single threaded server will effectively block all other I/O), but the >> main reason in my book is orthogonality, and the fact the client needs >> to do some breaking up anyway. > > That's why SCSI for example has a limit to UNMAP and WRITE SAME requests > (actually it has three limits: number of ranges per unmap, which in NBD > and in SCSI WRITE SAME is 1; number of blocks per unmap descriptor; > number of blocks per WRITE SAME operation). These limits however are a > different one than the read/write limit, because you want to support > systems where TRIM is much faster than regular I/O (and possibly even > O(1) if trimming something that is already trimmed). Then I think I will propose a doc patch to the extension-info branch that adds new INFO items for NBD_INFO_TRIM_SIZE and NBD_INFO_WRITE_ZERO_SIZE (if requested by client and replied by server, then this can be larger than the normal maximum size in NBD_INFO_BLOCK_SIZE; if either side doesn't request the info, then assume any maximum in NBD_INFO_BLOCK_SIZE applies, otherwise UINT32_MAX; and the two infos are separate items because NBD_FLAG_SEND_TRIM and NBD_FLAG_SEND_WRITE_ZEROES are orthogonally optional).
On 05/11/2016 02:34 AM, Quentin Casasnovas wrote: >> Your patch duplicates what is already present in qemu: >> >> commit eb38c3b67018ff8069e4f674a28661931a8a3e4f > Ha sorry I missed this! I wouldn't have tried to debug/fix myself otherwise :) It's okay - you at least sparked a good conversation of what the NBD protocol itself needs to think about in relation to these issues. > >> It also helps to state what version of qemu you were testing, as it is >> obviously not the (soon-to-be-released) version 2.6 which already has >> the fix. >> > > I was using qemu-2.5.0-rc3 on Gentoo but this was also verified on some > Debian systems which appears to be on 1.1.2+dfsg-6a+deb7u12 and on Ubuntu > Xenial 2.0.0+dfsg-2ubuntu1.22. > > I wrote the patch on top of https://github.com/bonzini/qemu.git:master > (a7e00e2) which didn't contain the fix last Friday. The master branch in Paolo's tree doesn't always actively track upstream (mainly when he is about to post a pull request that needs to be rebased). Best is to base patches against true upstream: git://git.qemu-project.org/qemu.git > > Anyway, cool if the fix is going into mainline :) Already in there for the 2.6 release candidates (and final 2.6 should be later today). And this thread argues that the stable 2.5.x branch should indeed backport the identified patch.
On 11 May 2016, at 15:08, Eric Blake <eblake@redhat.com> wrote: > Then I think I will propose a doc patch to the extension-info branch > that adds new INFO items for NBD_INFO_TRIM_SIZE and > NBD_INFO_WRITE_ZERO_SIZE (if requested by client and replied by server, > then this can be larger than the normal maximum size in > NBD_INFO_BLOCK_SIZE; if either side doesn't request the info, then > assume any maximum in NBD_INFO_BLOCK_SIZE applies, otherwise UINT32_MAX; > and the two infos are separate items because NBD_FLAG_SEND_TRIM and > NBD_FLAG_SEND_WRITE_ZEROES are orthogonally optional). mmmm ... at this rate you'd be better with a list of command / maximum size tuples! -- Alex Bligh
On 11/05/2016 16:55, Alex Bligh wrote: >> > Then I think I will propose a doc patch to the extension-info branch >> > that adds new INFO items for NBD_INFO_TRIM_SIZE and >> > NBD_INFO_WRITE_ZERO_SIZE (if requested by client and replied by server, >> > then this can be larger than the normal maximum size in >> > NBD_INFO_BLOCK_SIZE; if either side doesn't request the info, then >> > assume any maximum in NBD_INFO_BLOCK_SIZE applies, otherwise UINT32_MAX; >> > and the two infos are separate items because NBD_FLAG_SEND_TRIM and >> > NBD_FLAG_SEND_WRITE_ZEROES are orthogonally optional). > mmmm ... > > at this rate you'd be better with a list of command / maximum size > tuples! Not sure if serious or not, but anyway I think enumerating the items individually should be enough. It's not like commands grow on trees. :) Paolo
On Tue, May 10, 2016 at 04:08:50PM +0100, Alex Bligh wrote: > What surprises me is that a release kernel is using experimental > NBD extensions; there's no guarantee these won't change. Or does > fstrim work some other way? What makes you say NBD_CMD_TRIM is experimental? It's been implemented for years.
On Tue, May 10, 2016 at 09:29:23AM -0600, Eric Blake wrote: > On 05/10/2016 09:08 AM, Alex Bligh wrote: > > Eric, > > > >> Hmm. The current wording of the experimental block size additions does > >> NOT allow the client to send a NBD_CMD_TRIM with a size larger than the > >> maximum NBD_CMD_WRITE: > >> https://github.com/yoe/nbd/blob/extension-info/doc/proto.md#block-size-constraints > > > > Correct > > > >> Maybe we should revisit that in the spec, and/or advertise yet another > >> block size (since the maximum size for a trim and/or write_zeroes > >> request may indeed be different than the maximum size for a read/write). > > > > I think it's up to the server to either handle large requests, or > > for the client to break these up. > > But the question at hand here is whether we should permit servers to > advertise multiple maximum block sizes (one for read/write, another one > for trim/write_zero, or even two [at least qemu tracks a separate > maximum trim vs. write_zero sizing in its generic block layer]), or > merely stick with the current wording that requires clients that honor > maximum block size to obey the same maximum for ALL commands, regardless > of amount of data sent over the wire. > > > > > The core problem here is that the kernel (and, ahem, most servers) are > > ignorant of the block size extension, and need to guess how to break > > things up. In my view the client (kernel in this case) should > > be breaking the trim requests up into whatever size it uses as the > > maximum size write requests. But then it would have to know about block > > sizes which are in (another) experimental extension. > > Correct - no one has yet patched the kernel to honor block sizes > advertised through what is currently an experimental extension. (We > have ioctl(NBD_SET_BLKSIZE) which can be argued to set the kernel's > minimum block size, but I haven't audited whether the kernel actually > guarantees that all client requests are sent aligned to the value passed > that way - but we have nothing to set the maximum size, and are at the > mercy of however the kernel currently decides to split large requests). I don't actually think it does that at all, tbh. There is an "integrityhuge" test in the reference server test suite which performs a number of large requests (up to 50M), and which was created by a script that just does direct read requests to /dev/nbdX. It just so happens that most upper layers (filesystems etc) don't make requests larger than about 32MiB, but that's not related. > So the kernel is currently one of the clients that does NOT honor block > sizes, and as such, servers should be prepared for ANY size up to > UINT_MAX (other than DoS handling). My question above only applies to > clients that use the experimental block size extensions. Right. [...]
On Tue, May 10, 2016 at 04:38:29PM +0100, Alex Bligh wrote: > On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote: > > So the kernel is currently one of the clients that does NOT honor block > > sizes, and as such, servers should be prepared for ANY size up to > > UINT_MAX (other than DoS handling). > > Or not to permit a connection. Right -- and this is why I was recommending against making this a MUST in the first place. [...]
On 11 May 2016, at 22:06, Wouter Verhelst <w@uter.be> wrote: > On Tue, May 10, 2016 at 04:08:50PM +0100, Alex Bligh wrote: >> What surprises me is that a release kernel is using experimental >> NBD extensions; there's no guarantee these won't change. Or does >> fstrim work some other way? > > What makes you say NBD_CMD_TRIM is experimental? It's been implemented for > years. Me confusing it with NBD_CMD_WRITE_ZEROES which I corrected later in the thread.
On 11 May 2016, at 22:12, Wouter Verhelst <w@uter.be> wrote: > On Tue, May 10, 2016 at 04:38:29PM +0100, Alex Bligh wrote: >> On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote: >>> So the kernel is currently one of the clients that does NOT honor block >>> sizes, and as such, servers should be prepared for ANY size up to >>> UINT_MAX (other than DoS handling). >> >> Or not to permit a connection. > > Right -- and this is why I was recommending against making this a MUST in the > first place. Indeed, and it currently is a 'MAY': > except that if a server believes a client's behaviour constitutes > a denial of service, it MAY initiate a hard disconnect.
diff --git a/nbd.c b/nbd.c index b3d9654..e733669 100644 --- a/nbd.c +++ b/nbd.c @@ -1209,6 +1209,11 @@ static ssize_t nbd_co_send_reply(NBDRequest *req, struct nbd_reply *reply, return rc; } +static bool nbd_should_check_request_size(const struct nbd_request *request) +{ + return (request->type & NBD_CMD_MASK_COMMAND) != NBD_CMD_TRIM; +} + static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request *request) { NBDClient *client = req->client; @@ -1227,7 +1232,8 @@ static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request *reque goto out; } - if (request->len > NBD_MAX_BUFFER_SIZE) { + if (nbd_should_check_request_size(request) && + request->len > NBD_MAX_BUFFER_SIZE) { LOG("len (%u) is larger than max len (%u)", request->len, NBD_MAX_BUFFER_SIZE); rc = -EINVAL;
When running fstrim on a filesystem mounted through qemu-nbd with --discard=on, fstrim would fail with I/O errors: $ fstrim /k/spl/ice/ fstrim: /k/spl/ice/: FITRIM ioctl failed: Input/output error and qemu-nbd was spitting these: nbd.c:nbd_co_receive_request():L1232: len (94621696) is larger than max len (33554432) Enabling debug output on the NBD driver in the Linux kernel showed the request length field sent was the one received and that qemu-nbd returned 22 (EINVAL) as error code: EXT4-fs (nbd0p1): mounted filesystem with ordered data mode. Opts: discard block nbd0: request ffff880094c0cc18: dequeued (flags=1) block nbd0: request ffff880094c0cc18: sending control (read@5255168,4096B) block nbd0: request ffff880094c0cc18: got reply block nbd0: request ffff880094c0cc18: got 4096 bytes data block nbd0: request ffff880094c0cc18: done block nbd0: request ffff8801728796d8: dequeued (flags=1) block nbd0: request ffff8801728796d8: sending control (trim/discard@39464960,45056B) block nbd0: request ffff8801728796d8: got reply block nbd0: request ffff8801728796d8: done block nbd0: request ffff880172879ae0: dequeued (flags=1) block nbd0: request ffff880172879ae0: sending control (trim/discard@39653376,16384B) block nbd0: request ffff880172879ae0: got reply block nbd0: request ffff880172879ae0: done block nbd0: request ffff880172879d90: dequeued (flags=1) block nbd0: request ffff880172879d90: sending control (trim/discard@40644608,94621696B) ^^^^^^^^ block nbd0: Other side returned error (22) ^^ The length of the request seems huge but this is really just the filesystem telling the block device driver that "this length should be trimmed", and, unlike for a NBD_CMD_READ or NBD_CMD_WRITE, we'll not try to read/write that amount of data from/to the NBD socket. It is thus safe to remove the length check for a NBD_CMD_TRIM. I've confirmed this with both the protocol documentation at: https://github.com/yoe/nbd/blob/master/doc/proto.md and looking at the kernel side implementation of the nbd device (drivers/block/nbd.c) where it only sends the request header with no data for a NBD_CMD_TRIM. With this fix in, I am now able to run fstrim on my qcow2 images and keep them small (or at least keep their size proportional to the amount of data present on them). Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> CC: Paolo Bonzini <pbonzini@redhat.com> CC: <qemu-devel@nongnu.org> CC: <qemu-stable@nongnu.org> CC: <qemu-trivial@nongnu.org> --- nbd.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)