diff mbox

nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

Message ID 1462524302-15558-1-git-send-email-quentin.casasnovas@oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Quentin Casasnovas May 6, 2016, 8:45 a.m. UTC
When running fstrim on a filesystem mounted through qemu-nbd with
--discard=on, fstrim would fail with I/O errors:

  $ fstrim /k/spl/ice/
  fstrim: /k/spl/ice/: FITRIM ioctl failed: Input/output error

and qemu-nbd was spitting these:

  nbd.c:nbd_co_receive_request():L1232: len (94621696) is larger than max len (33554432)

Enabling debug output on the NBD driver in the Linux kernel showed the
request length field sent was the one received and that qemu-nbd returned
22 (EINVAL) as error code:

  EXT4-fs (nbd0p1): mounted filesystem with ordered data mode. Opts: discard
  block nbd0: request ffff880094c0cc18: dequeued (flags=1)
  block nbd0: request ffff880094c0cc18: sending control (read@5255168,4096B)
  block nbd0: request ffff880094c0cc18: got reply
  block nbd0: request ffff880094c0cc18: got 4096 bytes data
  block nbd0: request ffff880094c0cc18: done
  block nbd0: request ffff8801728796d8: dequeued (flags=1)
  block nbd0: request ffff8801728796d8: sending control (trim/discard@39464960,45056B)
  block nbd0: request ffff8801728796d8: got reply
  block nbd0: request ffff8801728796d8: done
  block nbd0: request ffff880172879ae0: dequeued (flags=1)
  block nbd0: request ffff880172879ae0: sending control (trim/discard@39653376,16384B)
  block nbd0: request ffff880172879ae0: got reply
  block nbd0: request ffff880172879ae0: done
  block nbd0: request ffff880172879d90: dequeued (flags=1)
  block nbd0: request ffff880172879d90: sending control (trim/discard@40644608,94621696B)
                                                                               ^^^^^^^^
  block nbd0: Other side returned error (22)
                                         ^^

The length of the request seems huge but this is really just the filesystem
telling the block device driver that "this length should be trimmed", and,
unlike for a NBD_CMD_READ or NBD_CMD_WRITE, we'll not try to read/write
that amount of data from/to the NBD socket.  It is thus safe to remove the
length check for a NBD_CMD_TRIM.

I've confirmed this with both the protocol documentation at:

 https://github.com/yoe/nbd/blob/master/doc/proto.md

and looking at the kernel side implementation of the nbd device
(drivers/block/nbd.c) where it only sends the request header with no data
for a NBD_CMD_TRIM.

With this fix in, I am now able to run fstrim on my qcow2 images and keep
them small (or at least keep their size proportional to the amount of data
present on them).

Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: <qemu-devel@nongnu.org>
CC: <qemu-stable@nongnu.org>
CC: <qemu-trivial@nongnu.org>
---
 nbd.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Eric Blake May 10, 2016, 2:01 p.m. UTC | #1
[adding nbd-devel, qemu-block]

On 05/06/2016 02:45 AM, Quentin Casasnovas wrote:
> When running fstrim on a filesystem mounted through qemu-nbd with
> --discard=on, fstrim would fail with I/O errors:
> 
>   $ fstrim /k/spl/ice/
>   fstrim: /k/spl/ice/: FITRIM ioctl failed: Input/output error
> 
> and qemu-nbd was spitting these:
> 
>   nbd.c:nbd_co_receive_request():L1232: len (94621696) is larger than max len (33554432)
> 

> 
> The length of the request seems huge but this is really just the filesystem
> telling the block device driver that "this length should be trimmed", and,
> unlike for a NBD_CMD_READ or NBD_CMD_WRITE, we'll not try to read/write
> that amount of data from/to the NBD socket.  It is thus safe to remove the
> length check for a NBD_CMD_TRIM.
> 
> I've confirmed this with both the protocol documentation at:
> 
>  https://github.com/yoe/nbd/blob/master/doc/proto.md

Hmm. The current wording of the experimental block size additions does
NOT allow the client to send a NBD_CMD_TRIM with a size larger than the
maximum NBD_CMD_WRITE:
https://github.com/yoe/nbd/blob/extension-info/doc/proto.md#block-size-constraints

Maybe we should revisit that in the spec, and/or advertise yet another
block size (since the maximum size for a trim and/or write_zeroes
request may indeed be different than the maximum size for a read/write).

But since the kernel is the one sending the large length request, and
since you are right that this is not a denial-of-service in the amount
of data being sent in a single NBD message, I definitely agree that qemu
would be wise as a quality-of-implementation to allow the larger size,
for maximum interoperability, even if it exceeds advertised limits (that
is, when no limits are advertised, we should handle everything possible
if it is not so large as to be construed a denial-of-service, and
NBD_CMD_TRIM is not large; and when limits ARE advertised, a client that
violates limits is out of spec but we can still be liberal and respond
successfully to such a client rather than having to outright reject it).
 So I think this patch is headed in the right direction.

> 
> and looking at the kernel side implementation of the nbd device
> (drivers/block/nbd.c) where it only sends the request header with no data
> for a NBD_CMD_TRIM.
> 
> With this fix in, I am now able to run fstrim on my qcow2 images and keep
> them small (or at least keep their size proportional to the amount of data
> present on them).
> 
> Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: <qemu-devel@nongnu.org>
> CC: <qemu-stable@nongnu.org>
> CC: <qemu-trivial@nongnu.org>

This is NOT trivial material and should not go in through that tree.
However, I concur that it qualifies for a backport on a stable branch.

> ---
>  nbd.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/nbd.c b/nbd.c
> index b3d9654..e733669 100644
> --- a/nbd.c
> +++ b/nbd.c
> @@ -1209,6 +1209,11 @@ static ssize_t nbd_co_send_reply(NBDRequest *req, struct nbd_reply *reply,
>      return rc;
>  }
>  
> +static bool nbd_should_check_request_size(const struct nbd_request *request)
> +{
> +        return (request->type & NBD_CMD_MASK_COMMAND) != NBD_CMD_TRIM;
> +}
> +
>  static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request *request)
>  {
>      NBDClient *client = req->client;
> @@ -1227,7 +1232,8 @@ static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request *reque
>          goto out;
>      }
>  
> -    if (request->len > NBD_MAX_BUFFER_SIZE) {
> +    if (nbd_should_check_request_size(request) &&
> +        request->len > NBD_MAX_BUFFER_SIZE) {

I'd rather sort out the implications of this on the NBD protocol before
taking anything into qemu.  We've got time on our hand, so let's use it
to get this right.  (That, and I have several pending patches that
conflict with this as part of adding WRITE_ZEROES and INFO_BLOCK_SIZE
support, where it may be easier to resubmit this fix on top of my
pending patches).

>          LOG("len (%u) is larger than max len (%u)",
>              request->len, NBD_MAX_BUFFER_SIZE);
>          rc = -EINVAL;
>
Alex Bligh May 10, 2016, 3:08 p.m. UTC | #2
Eric,

> Hmm. The current wording of the experimental block size additions does
> NOT allow the client to send a NBD_CMD_TRIM with a size larger than the
> maximum NBD_CMD_WRITE:
> https://github.com/yoe/nbd/blob/extension-info/doc/proto.md#block-size-constraints

Correct

> Maybe we should revisit that in the spec, and/or advertise yet another
> block size (since the maximum size for a trim and/or write_zeroes
> request may indeed be different than the maximum size for a read/write).

I think it's up to the server to either handle large requests, or
for the client to break these up.

The core problem here is that the kernel (and, ahem, most servers) are
ignorant of the block size extension, and need to guess how to break
things up. In my view the client (kernel in this case) should
be breaking the trim requests up into whatever size it uses as the
maximum size write requests. But then it would have to know about block
sizes which are in (another) experimental extension.

I've nothing against you fixing it qemu's server (after all I don't
think there is anything to /prohibit/ a server working on something
larger than the maximum block size), and ...

> But since the kernel is the one sending the large length request, and
> since you are right that this is not a denial-of-service in the amount
> of data being sent in a single NBD message,

... currently there isn't actually a maximum block size advertised (that
being in another experimental addition), so this is just DoS prevention,
which qemu is free to do how it wishes.

> I definitely agree that qemu
> would be wise as a quality-of-implementation to allow the larger size,
> for maximum interoperability, even if it exceeds advertised limits (that
> is, when no limits are advertised, we should handle everything possible
> if it is not so large as to be construed a denial-of-service, and
> NBD_CMD_TRIM is not large; and when limits ARE advertised, a client that
> violates limits is out of spec but we can still be liberal and respond
> successfully to such a client rather than having to outright reject it).
> So I think this patch is headed in the right direction.

I'd agree with that.

What surprises me is that a release kernel is using experimental
NBD extensions; there's no guarantee these won't change. Or does
fstrim work some other way?

Alex

> 
>> 
>> and looking at the kernel side implementation of the nbd device
>> (drivers/block/nbd.c) where it only sends the request header with no data
>> for a NBD_CMD_TRIM.
>> 
>> With this fix in, I am now able to run fstrim on my qcow2 images and keep
>> them small (or at least keep their size proportional to the amount of data
>> present on them).
>> 
>> Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
>> CC: Paolo Bonzini <pbonzini@redhat.com>
>> CC: <qemu-devel@nongnu.org>
>> CC: <qemu-stable@nongnu.org>
>> CC: <qemu-trivial@nongnu.org>
> 
> This is NOT trivial material and should not go in through that tree.
> However, I concur that it qualifies for a backport on a stable branch.
> 
>> ---
>> nbd.c | 8 +++++++-
>> 1 file changed, 7 insertions(+), 1 deletion(-)
>> 
>> diff --git a/nbd.c b/nbd.c
>> index b3d9654..e733669 100644
>> --- a/nbd.c
>> +++ b/nbd.c
>> @@ -1209,6 +1209,11 @@ static ssize_t nbd_co_send_reply(NBDRequest *req, struct nbd_reply *reply,
>>     return rc;
>> }
>> 
>> +static bool nbd_should_check_request_size(const struct nbd_request *request)
>> +{
>> +        return (request->type & NBD_CMD_MASK_COMMAND) != NBD_CMD_TRIM;
>> +}
>> +
>> static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request *request)
>> {
>>     NBDClient *client = req->client;
>> @@ -1227,7 +1232,8 @@ static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request *reque
>>         goto out;
>>     }
>> 
>> -    if (request->len > NBD_MAX_BUFFER_SIZE) {
>> +    if (nbd_should_check_request_size(request) &&
>> +        request->len > NBD_MAX_BUFFER_SIZE) {
> 
> I'd rather sort out the implications of this on the NBD protocol before
> taking anything into qemu.  We've got time on our hand, so let's use it
> to get this right.  (That, and I have several pending patches that
> conflict with this as part of adding WRITE_ZEROES and INFO_BLOCK_SIZE
> support, where it may be easier to resubmit this fix on top of my
> pending patches).
> 
>>         LOG("len (%u) is larger than max len (%u)",
>>             request->len, NBD_MAX_BUFFER_SIZE);
>>         rc = -EINVAL;
>> 
> 
> --
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 
> ------------------------------------------------------------------------------
> Mobile security can be enabling, not merely restricting. Employees who
> bring their own devices (BYOD) to work are irked by the imposition of MDM
> restrictions. Mobile Device Manager Plus allows you to control only the
> apps on BYO-devices by containerizing them, leaving personal data untouched!
> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j_______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general

--
Alex Bligh
Eric Blake May 10, 2016, 3:29 p.m. UTC | #3
On 05/10/2016 09:08 AM, Alex Bligh wrote:
> Eric,
> 
>> Hmm. The current wording of the experimental block size additions does
>> NOT allow the client to send a NBD_CMD_TRIM with a size larger than the
>> maximum NBD_CMD_WRITE:
>> https://github.com/yoe/nbd/blob/extension-info/doc/proto.md#block-size-constraints
> 
> Correct
> 
>> Maybe we should revisit that in the spec, and/or advertise yet another
>> block size (since the maximum size for a trim and/or write_zeroes
>> request may indeed be different than the maximum size for a read/write).
> 
> I think it's up to the server to either handle large requests, or
> for the client to break these up.

But the question at hand here is whether we should permit servers to
advertise multiple maximum block sizes (one for read/write, another one
for trim/write_zero, or even two [at least qemu tracks a separate
maximum trim vs. write_zero sizing in its generic block layer]), or
merely stick with the current wording that requires clients that honor
maximum block size to obey the same maximum for ALL commands, regardless
of amount of data sent over the wire.

> 
> The core problem here is that the kernel (and, ahem, most servers) are
> ignorant of the block size extension, and need to guess how to break
> things up. In my view the client (kernel in this case) should
> be breaking the trim requests up into whatever size it uses as the
> maximum size write requests. But then it would have to know about block
> sizes which are in (another) experimental extension.

Correct - no one has yet patched the kernel to honor block sizes
advertised through what is currently an experimental extension.  (We
have ioctl(NBD_SET_BLKSIZE) which can be argued to set the kernel's
minimum block size, but I haven't audited whether the kernel actually
guarantees that all client requests are sent aligned to the value passed
that way - but we have nothing to set the maximum size, and are at the
mercy of however the kernel currently decides to split large requests).
 So the kernel is currently one of the clients that does NOT honor block
sizes, and as such, servers should be prepared for ANY size up to
UINT_MAX (other than DoS handling).  My question above only applies to
clients that use the experimental block size extensions.

> 
> I've nothing against you fixing it qemu's server (after all I don't
> think there is anything to /prohibit/ a server working on something
> larger than the maximum block size), and ...
> 
>> But since the kernel is the one sending the large length request, and
>> since you are right that this is not a denial-of-service in the amount
>> of data being sent in a single NBD message,
> 
> ... currently there isn't actually a maximum block size advertised (that
> being in another experimental addition), so this is just DoS prevention,
> which qemu is free to do how it wishes.

Okay, sounds like that part is uncontroversial, and it is indeed worth
improving qemu's behavior.

> 
> What surprises me is that a release kernel is using experimental
> NBD extensions; there's no guarantee these won't change. Or does
> fstrim work some other way?

No extension in play.  The kernel is obeying NBD_FLAG_SEND_TRIM, which
is in the normative standard, and unrelated to the INFO/BLOCK_SIZE
extensions.
Alex Bligh May 10, 2016, 3:38 p.m. UTC | #4
Eric,

On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote:
>>> Maybe we should revisit that in the spec, and/or advertise yet another
>>> block size (since the maximum size for a trim and/or write_zeroes
>>> request may indeed be different than the maximum size for a read/write).
>> 
>> I think it's up to the server to either handle large requests, or
>> for the client to break these up.
> 
> But the question at hand here is whether we should permit servers to
> advertise multiple maximum block sizes (one for read/write, another one
> for trim/write_zero, or even two [at least qemu tracks a separate
> maximum trim vs. write_zero sizing in its generic block layer]), or
> merely stick with the current wording that requires clients that honor
> maximum block size to obey the same maximum for ALL commands, regardless
> of amount of data sent over the wire.

In my view, we should not change this. Block sizes maxima are not there
to support DoS prevention (that's a separate phrase). They are there
to impose maximum block sizes. Adding a different maximum block size
for different commands is way too overengineered. There are after
all reasons (especially without structured replies) why you'd want
different maximum block sizes for writes and reads. If clients support
block sizes, they will necessarily have to have the infrastructure
to break requests up.

IE maximum block size should continue to mean maximum block size.

>> 
>> The core problem here is that the kernel (and, ahem, most servers) are
>> ignorant of the block size extension, and need to guess how to break
>> things up. In my view the client (kernel in this case) should
>> be breaking the trim requests up into whatever size it uses as the
>> maximum size write requests. But then it would have to know about block
>> sizes which are in (another) experimental extension.
> 
> Correct - no one has yet patched the kernel to honor block sizes
> advertised through what is currently an experimental extension.

Unsurprising at it's still experimental, and only settled down a couple
of weeks ago :-)

>  (We
> have ioctl(NBD_SET_BLKSIZE) which can be argued to set the kernel's
> minimum block size,

Technically that is 'out of band transmission of block size
constraints' :-)

> but I haven't audited whether the kernel actually
> guarantees that all client requests are sent aligned to the value passed
> that way - but we have nothing to set the maximum size,

indeed

> and are at the
> mercy of however the kernel currently decides to split large requests).

I am surprised TRIM doesn't get broken up the same way READ and WRITE
do.

> So the kernel is currently one of the clients that does NOT honor block
> sizes, and as such, servers should be prepared for ANY size up to
> UINT_MAX (other than DoS handling).

Or not to permit a connection.

>  My question above only applies to
> clients that use the experimental block size extensions.

Indeed.

>> What surprises me is that a release kernel is using experimental
>> NBD extensions; there's no guarantee these won't change. Or does
>> fstrim work some other way?
> 
> No extension in play.  The kernel is obeying NBD_FLAG_SEND_TRIM, which
> is in the normative standard, and unrelated to the INFO/BLOCK_SIZE
> extensions.

My mistake. I was confusing 'WRITE_ZEROES' with 'TRIM'.

--
Alex Bligh
Alex Bligh May 10, 2016, 3:41 p.m. UTC | #5
On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote:

> So the kernel is currently one of the clients that does NOT honor block
> sizes, and as such, servers should be prepared for ANY size up to
> UINT_MAX (other than DoS handling).

Interesting followup question:

If the kernel does not fragment TRIM requests at all (in the
same way it fragments read and write requests), I suspect
something bad may happen with TRIM requests over 2^31
in size (particularly over 2^32 in size), as the length
field in nbd only has 32 bits.

Whether it supports block size constraints or not, it is
going to need to do *some* breaking up of requests.

--
Alex Bligh
Quentin Casasnovas May 10, 2016, 3:45 p.m. UTC | #6
On Tue, May 10, 2016 at 04:38:29PM +0100, Alex Bligh wrote:
> Eric,
> 
> On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote:
> >>> Maybe we should revisit that in the spec, and/or advertise yet another
> >>> block size (since the maximum size for a trim and/or write_zeroes
> >>> request may indeed be different than the maximum size for a read/write).
> >> 
> >> I think it's up to the server to either handle large requests, or
> >> for the client to break these up.
> > 
> > But the question at hand here is whether we should permit servers to
> > advertise multiple maximum block sizes (one for read/write, another one
> > for trim/write_zero, or even two [at least qemu tracks a separate
> > maximum trim vs. write_zero sizing in its generic block layer]), or
> > merely stick with the current wording that requires clients that honor
> > maximum block size to obey the same maximum for ALL commands, regardless
> > of amount of data sent over the wire.
> 
> In my view, we should not change this. Block sizes maxima are not there
> to support DoS prevention (that's a separate phrase). They are there
> to impose maximum block sizes. Adding a different maximum block size
> for different commands is way too overengineered. There are after
> all reasons (especially without structured replies) why you'd want
> different maximum block sizes for writes and reads. If clients support
> block sizes, they will necessarily have to have the infrastructure
> to break requests up.
> 
> IE maximum block size should continue to mean maximum block size.
> 
> >> 
> >> The core problem here is that the kernel (and, ahem, most servers) are
> >> ignorant of the block size extension, and need to guess how to break
> >> things up. In my view the client (kernel in this case) should
> >> be breaking the trim requests up into whatever size it uses as the
> >> maximum size write requests. But then it would have to know about block
> >> sizes which are in (another) experimental extension.
> > 
> > Correct - no one has yet patched the kernel to honor block sizes
> > advertised through what is currently an experimental extension.
> 
> Unsurprising at it's still experimental, and only settled down a couple
> of weeks ago :-)
> 
> >  (We
> > have ioctl(NBD_SET_BLKSIZE) which can be argued to set the kernel's
> > minimum block size,
> 
> Technically that is 'out of band transmission of block size
> constraints' :-)
> 
> > but I haven't audited whether the kernel actually
> > guarantees that all client requests are sent aligned to the value passed
> > that way - but we have nothing to set the maximum size,
> 
> indeed
> 
> > and are at the
> > mercy of however the kernel currently decides to split large requests).
> 
> I am surprised TRIM doesn't get broken up the same way READ and WRITE
> do.
>

I'm by no mean an expert in this, but why would the kernel break up those
TRIM commands?  After all, breaking things up makes sense when the length
of the request is big, not that much when it only contains the request
header, which is the case for TRIM commands.

What am I missing?

Quentin
Eric Blake May 10, 2016, 3:46 p.m. UTC | #7
On 05/10/2016 09:41 AM, Alex Bligh wrote:
> 
> On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote:
> 
>> So the kernel is currently one of the clients that does NOT honor block
>> sizes, and as such, servers should be prepared for ANY size up to
>> UINT_MAX (other than DoS handling).
> 
> Interesting followup question:
> 
> If the kernel does not fragment TRIM requests at all (in the
> same way it fragments read and write requests), I suspect
> something bad may happen with TRIM requests over 2^31
> in size (particularly over 2^32 in size), as the length
> field in nbd only has 32 bits.
> 
> Whether it supports block size constraints or not, it is
> going to need to do *some* breaking up of requests.

Does anyone have an easy way to cause the kernel to request a trim
operation that large on a > 4G export?  I'm not familiar enough with
EXT4 operation to know what file system operations you can run to
ultimately indirectly create a file system trim operation that large.
But maybe there is something simpler - does the kernel let you use the
fallocate(2) syscall operation with FALLOC_FL_PUNCH_HOLE or
FALLOC_FL_ZERO_RANGE on an fd backed by an NBD device?
Alex Bligh May 10, 2016, 3:49 p.m. UTC | #8
On 10 May 2016, at 16:45, Quentin Casasnovas <quentin.casasnovas@oracle.com> wrote:

> I'm by no mean an expert in this, but why would the kernel break up those
> TRIM commands?  After all, breaking things up makes sense when the length
> of the request is big, not that much when it only contains the request
> header, which is the case for TRIM commands.

1. You are assuming that the only reason for limiting the size of operations is
   limiting the data transferred within one request. That is not necessarily
   the case. There are good reasons (if only orthogonality) to have limits
   in place even where no data is transferred.

2. As and when the blocksize extension is implemented in the kernel (it isn't
   now), the protocol requires it.

3. The maximum length of an NBD trim operation is 2^32. The maximum length
   of a trim operation is larger. Therefore the kernel needs to do at least
   some breaking up.
Alex Bligh May 10, 2016, 3:52 p.m. UTC | #9
On 10 May 2016, at 16:46, Eric Blake <eblake@redhat.com> wrote:

> Does anyone have an easy way to cause the kernel to request a trim
> operation that large on a > 4G export?  I'm not familiar enough with
> EXT4 operation to know what file system operations you can run to
> ultimately indirectly create a file system trim operation that large.
> But maybe there is something simpler - does the kernel let you use the
> fallocate(2) syscall operation with FALLOC_FL_PUNCH_HOLE or
> FALLOC_FL_ZERO_RANGE on an fd backed by an NBD device?

Not tried it, but fallocate(1) with -p ?

http://man7.org/linux/man-pages/man1/fallocate.1.html

As it takes length and offset in TB, PB, EB, ZB and YB, it
seems to be 64 bit aware :-)

--
Alex Bligh
Quentin Casasnovas May 10, 2016, 3:54 p.m. UTC | #10
On Tue, May 10, 2016 at 09:46:36AM -0600, Eric Blake wrote:
> On 05/10/2016 09:41 AM, Alex Bligh wrote:
> > 
> > On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote:
> > 
> >> So the kernel is currently one of the clients that does NOT honor block
> >> sizes, and as such, servers should be prepared for ANY size up to
> >> UINT_MAX (other than DoS handling).
> > 
> > Interesting followup question:
> > 
> > If the kernel does not fragment TRIM requests at all (in the
> > same way it fragments read and write requests), I suspect
> > something bad may happen with TRIM requests over 2^31
> > in size (particularly over 2^32 in size), as the length
> > field in nbd only has 32 bits.
> > 
> > Whether it supports block size constraints or not, it is
> > going to need to do *some* breaking up of requests.
> 
> Does anyone have an easy way to cause the kernel to request a trim
> operation that large on a > 4G export?  I'm not familiar enough with
> EXT4 operation to know what file system operations you can run to
> ultimately indirectly create a file system trim operation that large.
> But maybe there is something simpler - does the kernel let you use the
> fallocate(2) syscall operation with FALLOC_FL_PUNCH_HOLE or
> FALLOC_FL_ZERO_RANGE on an fd backed by an NBD device?
> 

It was fairly reproducible here, we just used a random qcow2 image with
some Debian minimal system pre-installed, mounted that qcow2 image through
qemu-nbd then compiled a whole kernel inside it.  Then you can make clean
and run fstrim on the mount point.  I'm assuming you can go faster than
that by just writing a big file to the qcow2 image mounted without -o
discard, delete the big file, then remount with -o discard + run fstrim.

Quentin
Quentin Casasnovas May 10, 2016, 4:04 p.m. UTC | #11
On Tue, May 10, 2016 at 04:49:57PM +0100, Alex Bligh wrote:
> 
> On 10 May 2016, at 16:45, Quentin Casasnovas <quentin.casasnovas@oracle.com> wrote:
> 
> > I'm by no mean an expert in this, but why would the kernel break up those
> > TRIM commands?  After all, breaking things up makes sense when the length
> > of the request is big, not that much when it only contains the request
> > header, which is the case for TRIM commands.
> 
> 1. You are assuming that the only reason for limiting the size of operations is
>    limiting the data transferred within one request. That is not necessarily
>    the case. There are good reasons (if only orthogonality) to have limits
>    in place even where no data is transferred.
> 
> 2. As and when the blocksize extension is implemented in the kernel (it isn't
>    now), the protocol requires it.
> 
> 3. The maximum length of an NBD trim operation is 2^32. The maximum length
>    of a trim operation is larger. Therefore the kernel needs to do at least
>    some breaking up.

Alright, I assumed by 'length of an NBD request', the specification was
talking about the length of.. well, the request as opposed to whatever is
in the length field of the request header.

Is there a use case where you'd want to split up a single big TRIM request
in smaller ones (as in some hardware would not support it or something)?
Even then, it looks like this splitting up would be hardware dependant and
better implemented in block device drivers.

I'm just finding odd that something that fits inside the length field can't
be used.  I do agree with your point number 3, obviously if the lenght
field type doesn't allow something bigger than a u32, then the kernel has
to do some breaking up in that case.

Quentin
Alex Bligh May 10, 2016, 4:23 p.m. UTC | #12
On 10 May 2016, at 17:04, Quentin Casasnovas <quentin.casasnovas@oracle.com> wrote:

> Alright, I assumed by 'length of an NBD request', the specification was
> talking about the length of.. well, the request as opposed to whatever is
> in the length field of the request header.

With NBD_CMD_TRIM the length in the header field is 32 bit and specifies
the length of data to trim, not the length of the data transferred (which
is none).

> Is there a use case where you'd want to split up a single big TRIM request
> in smaller ones (as in some hardware would not support it or something)?
> Even then, it looks like this splitting up would be hardware dependant and
> better implemented in block device drivers.

Part of the point of the block size extension is to push such limits to the
client.

I could make up use cases (e.g. that performing a multi-gigabyte trim in
a single threaded server will effectively block all other I/O), but the
main reason in my book is orthogonality, and the fact the client needs
to do some breaking up anyway.

> I'm just finding odd that something that fits inside the length field can't
> be used.

That's a different point. That's Qemu's 'Denial of Service Attack'
prevention, *not* maximum block sizes. It isn't dropping it because
of a maximum block size parameter. If it doesn't support the block size
extension which the version you're looking at does not, it's meant
to handle requests up to 2^32-1 long EXCEPT that it MAY error requests
so long as to cause a denial of service attack. As this doesn't fit
into that case (it's a TRIM), it shouldn't be erroring it on that grounds.

I agree Qemu should fix that.

(So in a sense Eric and I are arguing about something irrelevant to
your current problem, which is how this would work /with/ the block
size extensions, as Eric is in the process of implementing them).

>  I do agree with your point number 3, obviously if the lenght
> field type doesn't allow something bigger than a u32, then the kernel has
> to do some breaking up in that case.
Quentin Casasnovas May 10, 2016, 4:27 p.m. UTC | #13
On Tue, May 10, 2016 at 05:23:21PM +0100, Alex Bligh wrote:
> 
> On 10 May 2016, at 17:04, Quentin Casasnovas <quentin.casasnovas@oracle.com> wrote:
> 
> > Alright, I assumed by 'length of an NBD request', the specification was
> > talking about the length of.. well, the request as opposed to whatever is
> > in the length field of the request header.
> 
> With NBD_CMD_TRIM the length in the header field is 32 bit and specifies
> the length of data to trim, not the length of the data transferred (which
> is none).
> 
> > Is there a use case where you'd want to split up a single big TRIM request
> > in smaller ones (as in some hardware would not support it or something)?
> > Even then, it looks like this splitting up would be hardware dependant and
> > better implemented in block device drivers.
> 
> Part of the point of the block size extension is to push such limits to the
> client.
> 
> I could make up use cases (e.g. that performing a multi-gigabyte trim in
> a single threaded server will effectively block all other I/O), but the
> main reason in my book is orthogonality, and the fact the client needs
> to do some breaking up anyway.
> 
> > I'm just finding odd that something that fits inside the length field can't
> > be used.
> 
> That's a different point. That's Qemu's 'Denial of Service Attack'
> prevention, *not* maximum block sizes. It isn't dropping it because
> of a maximum block size parameter. If it doesn't support the block size
> extension which the version you're looking at does not, it's meant
> to handle requests up to 2^32-1 long EXCEPT that it MAY error requests
> so long as to cause a denial of service attack. As this doesn't fit
> into that case (it's a TRIM), it shouldn't be erroring it on that grounds.
> 
> I agree Qemu should fix that.
> 
> (So in a sense Eric and I are arguing about something irrelevant to
> your current problem, which is how this would work /with/ the block
> size extensions, as Eric is in the process of implementing them).
> 

Riight!  OK understood, thanks for the explanation.

Quentin
Quentin Casasnovas May 10, 2016, 4:33 p.m. UTC | #14
On Tue, May 10, 2016 at 05:54:44PM +0200, Quentin Casasnovas wrote:
> On Tue, May 10, 2016 at 09:46:36AM -0600, Eric Blake wrote:
> > On 05/10/2016 09:41 AM, Alex Bligh wrote:
> > > 
> > > On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote:
> > > 
> > >> So the kernel is currently one of the clients that does NOT honor block
> > >> sizes, and as such, servers should be prepared for ANY size up to
> > >> UINT_MAX (other than DoS handling).
> > > 
> > > Interesting followup question:
> > > 
> > > If the kernel does not fragment TRIM requests at all (in the
> > > same way it fragments read and write requests), I suspect
> > > something bad may happen with TRIM requests over 2^31
> > > in size (particularly over 2^32 in size), as the length
> > > field in nbd only has 32 bits.
> > > 
> > > Whether it supports block size constraints or not, it is
> > > going to need to do *some* breaking up of requests.
> > 
> > Does anyone have an easy way to cause the kernel to request a trim
> > operation that large on a > 4G export?  I'm not familiar enough with
> > EXT4 operation to know what file system operations you can run to
> > ultimately indirectly create a file system trim operation that large.
> > But maybe there is something simpler - does the kernel let you use the
> > fallocate(2) syscall operation with FALLOC_FL_PUNCH_HOLE or
> > FALLOC_FL_ZERO_RANGE on an fd backed by an NBD device?
> > 
> 
> It was fairly reproducible here, we just used a random qcow2 image with
> some Debian minimal system pre-installed, mounted that qcow2 image through
> qemu-nbd then compiled a whole kernel inside it.  Then you can make clean
> and run fstrim on the mount point.  I'm assuming you can go faster than
> that by just writing a big file to the qcow2 image mounted without -o
> discard, delete the big file, then remount with -o discard + run fstrim.
> 

Looks like there's an easier way:

 $ qemu-img create -f qcow2 foo.qcow2 10G
 $ qemu-nbd --discard=on -c /dev/nbd0 foo.qcow2
 $ mkfs.ext4 /dev/nbd0
 mke2fs 1.42.13 (17-May-2015)
 Discarding device blocks: failed - Input/output error
 Creating filesystem with 2621440 4k blocks and 655360 inodes
 Filesystem UUID: 25aeb51f-0dea-4c1d-8b65-61f6bcdf97e9
 Superblock backups stored on blocks:
	   32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

 Allocating group tables: done
 Writing inode tables: done
 Creating journal (32768 blocks): done
 Writing superblocks and filesystem accounting information: done

Notice the "Discarding device blocks: failed - Input/output error" line, I
bet that it is mkfs.ext4 trying to trim all blocks prior to writing the
filesystem, but it gets an I/O error while doing so.  I haven't verified it
is the same problem, but it it isn't, simply mount the resulting filesystem
and run fstrim on it:

 $ mount -o discard /dev/nbd0 /tmp/foo
 $ fstrim /tmp/foo
 fstrim: /tmp/foo: FITRIM ioctl failed: Input/output error

Quentin
Paolo Bonzini May 10, 2016, 5:55 p.m. UTC | #15
On 10/05/2016 17:38, Alex Bligh wrote:
> > and are at the
> > mercy of however the kernel currently decides to split large requests).
> 
> I am surprised TRIM doesn't get broken up the same way READ and WRITE
> do.

The payload of TRIM has constant size, so it makes sense to do that.
The kernel then self-imposes a 2GB limit in blkdev_issue_discard.

On the other hand, READ and WRITE of size n can possibly require O(n)
memory.

Paolo
Micha? Belczyk May 10, 2016, 7:13 p.m. UTC | #16
On Tue, May 10, 2016 at 04:41:27PM +0100, Alex Bligh wrote:
>
>On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote:
>
>> So the kernel is currently one of the clients that does NOT honor block
>> sizes, and as such, servers should be prepared for ANY size up to
>> UINT_MAX (other than DoS handling).
>
>Interesting followup question:
>
>If the kernel does not fragment TRIM requests at all (in the
>same way it fragments read and write requests), I suspect
>something bad may happen with TRIM requests over 2^31
>in size (particularly over 2^32 in size), as the length
>field in nbd only has 32 bits.
>
>Whether it supports block size constraints or not, it is
>going to need to do *some* breaking up of requests.

Doesn't the kernel break TRIM requests at max_discard_sectors?

I remember testing the following change in the nbd kmod long time ago:

#if 0
		disk->queue->limits.max_discard_sectors = UINT_MAX;
#else
		disk->queue->limits.max_discard_sectors = 65536;
#endif

The problem with large TRIM requests is exactly the same as with other 
(READ/WRITE) large requests -- they _may_ take loads of time and if the client 
wants to support a fast switch over to another replica it must put some 
constraints on the timeout value... which may be very easily broken if you 
allow things like a 1GB trim request on the server using DIO on the other side 
(and say heavily fragmented sparse file over XFS, never mind).

I don't think it's the problem of QEMU NBD server to fix that, the constraint 
on the server side is perfectly fine, the problem is to note that a change on 
the client side is required to limit the maximum size of the TRIM requests.
The reason for the patch I pasted above was that at the time I looked into it 
there was no other way to change the TRIM request size via 
/sys/block/$dev/queue/max_discard_sectors, but maybe the more recent kernels 
allow that?
Not to mention that the NBD client option to set that at NBD queue setup time 
would be nice...

Just my 2p.
Eric Blake May 10, 2016, 8:24 p.m. UTC | #17
On 05/10/2016 10:33 AM, Quentin Casasnovas wrote:

> Looks like there's an easier way:
> 
>  $ qemu-img create -f qcow2 foo.qcow2 10G
>  $ qemu-nbd --discard=on -c /dev/nbd0 foo.qcow2
>  $ mkfs.ext4 /dev/nbd0
>  mke2fs 1.42.13 (17-May-2015)
>  Discarding device blocks: failed - Input/output error

strace on mkfs.ext4 shows:
...
open("/dev/nbd1", O_RDWR|O_EXCL)        = 3
stat("/dev/nbd1", {st_mode=S_IFBLK|0660, st_rdev=makedev(43, 1), ...}) = 0
ioctl(3, BLKDISCARDZEROES, [0])         = 0
ioctl(3, BLKROGET, [0])                 = 0
uname({sysname="Linux", nodename="red", ...}) = 0
ioctl(3, BLKDISCARD, [0, 1000000])      = 0
write(1, "Discarding device blocks: ", 26) = 26
write(1, "   4096/2621440", 15   4096/2621440)         = 15
write(1, "\10\10\10\10\10\10\10\10\10\10\10\1) = 15
ioctl(3, BLKDISCARD, [1000000, 80000000]) = -1 EIO (Input/output error)
write(1, "               ", 15               )         = 15
write(1, "\10\10\10\10\10\10\10\10\10\10\10\1) = 15
write(1, "failed - ", 9failed - )                = 9

so it does indeed look like the smaller request [0, 0x1000000] succeeded
while the larger [0x1000000, 0x80000000] failed with EIO.  But I
instrumented my qemu-nbd server to output all of the incoming requests
it was receiving from the kernel, and I never saw a mention of
NBD_CMD_TRIM being received.  Maybe it's dependent on what version of
the kernel and/or NBD kernel module you are running?

I also tried fallocate(1) on /dev/nbd1, as in:
 # strace fallocate -o 0 -l 64k -p /dev/nbd1
but it looks like the kernel doesn't yet wire up fallocate(2) to forward
to the appropriate device ioctls and/or NBD_CMD_TRIM, where strace says:

open("/dev/nbd1", O_RDWR)               = 3
fallocate(3, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 0, 65536) = -1
ENODEV (No such device)

The ENODEV is a rather unusual choice of error.
Eric Blake May 10, 2016, 8:34 p.m. UTC | #18
On 05/06/2016 02:45 AM, Quentin Casasnovas wrote:
> When running fstrim on a filesystem mounted through qemu-nbd with
> --discard=on, fstrim would fail with I/O errors:
> 
>   $ fstrim /k/spl/ice/
>   fstrim: /k/spl/ice/: FITRIM ioctl failed: Input/output error
> 
> and qemu-nbd was spitting these:
> 
>   nbd.c:nbd_co_receive_request():L1232: len (94621696) is larger than max len (33554432)

Your patch duplicates what is already present in qemu:

commit eb38c3b67018ff8069e4f674a28661931a8a3e4f
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Thu Jan 7 14:32:42 2016 +0100

    nbd-server: do not check request length except for reads and writes

    Only reads and writes need to allocate memory correspondent to the
    request length.  Other requests can be sent to the storage without
    allocating any memory, and thus any request length is acceptable.

    Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
    Cc: qemu-block@nongnu.org
    Reviewed-by: Max Reitz <mreitz@redhat.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

For the purposes of qemu-stable, it's better to backport the existing
patch than to write a new version of it.

It also helps to state what version of qemu you were testing, as it is
obviously not the (soon-to-be-released) version 2.6 which already has
the fix.

>  nbd.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/nbd.c b/nbd.c
> index b3d9654..e733669 100644
> --- a/nbd.c
> +++ b/nbd.c
> @@ -1209,6 +1209,11 @@ static ssize_t nbd_co_send_reply(NBDRequest *req, struct nbd_reply *reply,
>      return rc;
>  }
>  
> +static bool nbd_should_check_request_size(const struct nbd_request *request)
> +{
> +        return (request->type & NBD_CMD_MASK_COMMAND) != NBD_CMD_TRIM;
> +}
> +
>  static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request *request)
>  {
>      NBDClient *client = req->client;
> @@ -1227,7 +1232,8 @@ static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request *reque
>          goto out;
>      }
>  
> -    if (request->len > NBD_MAX_BUFFER_SIZE) {
> +    if (nbd_should_check_request_size(request) &&
> +        request->len > NBD_MAX_BUFFER_SIZE) {
>          LOG("len (%u) is larger than max len (%u)",
>              request->len, NBD_MAX_BUFFER_SIZE);
>          rc = -EINVAL;
>
Quentin Casasnovas May 11, 2016, 8:34 a.m. UTC | #19
On Tue, May 10, 2016 at 02:34:18PM -0600, Eric Blake wrote:
> On 05/06/2016 02:45 AM, Quentin Casasnovas wrote:
> > When running fstrim on a filesystem mounted through qemu-nbd with
> > --discard=on, fstrim would fail with I/O errors:
> > 
> >   $ fstrim /k/spl/ice/
> >   fstrim: /k/spl/ice/: FITRIM ioctl failed: Input/output error
> > 
> > and qemu-nbd was spitting these:
> > 
> >   nbd.c:nbd_co_receive_request():L1232: len (94621696) is larger than max len (33554432)
> 
> Your patch duplicates what is already present in qemu:
> 
> commit eb38c3b67018ff8069e4f674a28661931a8a3e4f
> Author: Paolo Bonzini <pbonzini@redhat.com>
> Date:   Thu Jan 7 14:32:42 2016 +0100
> 
>     nbd-server: do not check request length except for reads and writes
> 
>     Only reads and writes need to allocate memory correspondent to the
>     request length.  Other requests can be sent to the storage without
>     allocating any memory, and thus any request length is acceptable.
> 
>     Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
>     Cc: qemu-block@nongnu.org
>     Reviewed-by: Max Reitz <mreitz@redhat.com>
>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> For the purposes of qemu-stable, it's better to backport the existing
> patch than to write a new version of it.
> 

Ha sorry I missed this! I wouldn't have tried to debug/fix myself otherwise :)

> It also helps to state what version of qemu you were testing, as it is
> obviously not the (soon-to-be-released) version 2.6 which already has
> the fix.
>

I was using qemu-2.5.0-rc3 on Gentoo but this was also verified on some
Debian systems which appears to be on 1.1.2+dfsg-6a+deb7u12 and on Ubuntu
Xenial 2.0.0+dfsg-2ubuntu1.22.

I wrote the patch on top of https://github.com/bonzini/qemu.git:master
(a7e00e2) which didn't contain the fix last Friday.

Anyway, cool if the fix is going into mainline :)

Quentin
Paolo Bonzini May 11, 2016, 9:38 a.m. UTC | #20
On 10/05/2016 18:23, Alex Bligh wrote:
> > Is there a use case where you'd want to split up a single big TRIM request
> > in smaller ones (as in some hardware would not support it or something)?
> > Even then, it looks like this splitting up would be hardware dependant and
> > better implemented in block device drivers.
> 
> Part of the point of the block size extension is to push such limits to the
> client.
> 
> I could make up use cases (e.g. that performing a multi-gigabyte trim in
> a single threaded server will effectively block all other I/O), but the
> main reason in my book is orthogonality, and the fact the client needs
> to do some breaking up anyway.

That's why SCSI for example has a limit to UNMAP and WRITE SAME requests
(actually it has three limits: number of ranges per unmap, which in NBD
and in SCSI WRITE SAME is 1; number of blocks per unmap descriptor;
number of blocks per WRITE SAME operation).  These limits however are a
different one than the read/write limit, because you want to support
systems where TRIM is much faster than regular I/O (and possibly even
O(1) if trimming something that is already trimmed).

Paolo
Eric Blake May 11, 2016, 2:08 p.m. UTC | #21
On 05/11/2016 03:38 AM, Paolo Bonzini wrote:
> 
> 
> On 10/05/2016 18:23, Alex Bligh wrote:
>>> Is there a use case where you'd want to split up a single big TRIM request
>>> in smaller ones (as in some hardware would not support it or something)?
>>> Even then, it looks like this splitting up would be hardware dependant and
>>> better implemented in block device drivers.
>>
>> Part of the point of the block size extension is to push such limits to the
>> client.
>>
>> I could make up use cases (e.g. that performing a multi-gigabyte trim in
>> a single threaded server will effectively block all other I/O), but the
>> main reason in my book is orthogonality, and the fact the client needs
>> to do some breaking up anyway.
> 
> That's why SCSI for example has a limit to UNMAP and WRITE SAME requests
> (actually it has three limits: number of ranges per unmap, which in NBD
> and in SCSI WRITE SAME is 1; number of blocks per unmap descriptor;
> number of blocks per WRITE SAME operation).  These limits however are a
> different one than the read/write limit, because you want to support
> systems where TRIM is much faster than regular I/O (and possibly even
> O(1) if trimming something that is already trimmed).

Then I think I will propose a doc patch to the extension-info branch
that adds new INFO items for NBD_INFO_TRIM_SIZE and
NBD_INFO_WRITE_ZERO_SIZE (if requested by client and replied by server,
then this can be larger than the normal maximum size in
NBD_INFO_BLOCK_SIZE; if either side doesn't request the info, then
assume any maximum in NBD_INFO_BLOCK_SIZE applies, otherwise UINT32_MAX;
and the two infos are separate items because NBD_FLAG_SEND_TRIM and
NBD_FLAG_SEND_WRITE_ZEROES are orthogonally optional).
Eric Blake May 11, 2016, 2:11 p.m. UTC | #22
On 05/11/2016 02:34 AM, Quentin Casasnovas wrote:
>> Your patch duplicates what is already present in qemu:
>>
>> commit eb38c3b67018ff8069e4f674a28661931a8a3e4f

> Ha sorry I missed this! I wouldn't have tried to debug/fix myself otherwise :)

It's okay - you at least sparked a good conversation of what the NBD
protocol itself needs to think about in relation to these issues.

> 
>> It also helps to state what version of qemu you were testing, as it is
>> obviously not the (soon-to-be-released) version 2.6 which already has
>> the fix.
>>
> 
> I was using qemu-2.5.0-rc3 on Gentoo but this was also verified on some
> Debian systems which appears to be on 1.1.2+dfsg-6a+deb7u12 and on Ubuntu
> Xenial 2.0.0+dfsg-2ubuntu1.22.
> 
> I wrote the patch on top of https://github.com/bonzini/qemu.git:master
> (a7e00e2) which didn't contain the fix last Friday.

The master branch in Paolo's tree doesn't always actively track upstream
(mainly when he is about to post a pull request that needs to be
rebased).  Best is to base patches against true upstream:
git://git.qemu-project.org/qemu.git

> 
> Anyway, cool if the fix is going into mainline :)

Already in there for the 2.6 release candidates (and final 2.6 should be
later today).  And this thread argues that the stable 2.5.x branch
should indeed backport the identified patch.
Alex Bligh May 11, 2016, 2:55 p.m. UTC | #23
On 11 May 2016, at 15:08, Eric Blake <eblake@redhat.com> wrote:

> Then I think I will propose a doc patch to the extension-info branch
> that adds new INFO items for NBD_INFO_TRIM_SIZE and
> NBD_INFO_WRITE_ZERO_SIZE (if requested by client and replied by server,
> then this can be larger than the normal maximum size in
> NBD_INFO_BLOCK_SIZE; if either side doesn't request the info, then
> assume any maximum in NBD_INFO_BLOCK_SIZE applies, otherwise UINT32_MAX;
> and the two infos are separate items because NBD_FLAG_SEND_TRIM and
> NBD_FLAG_SEND_WRITE_ZEROES are orthogonally optional).

mmmm ...

at this rate you'd be better with a list of command / maximum size
tuples!

--
Alex Bligh
Paolo Bonzini May 11, 2016, 3:08 p.m. UTC | #24
On 11/05/2016 16:55, Alex Bligh wrote:
>> > Then I think I will propose a doc patch to the extension-info branch
>> > that adds new INFO items for NBD_INFO_TRIM_SIZE and
>> > NBD_INFO_WRITE_ZERO_SIZE (if requested by client and replied by server,
>> > then this can be larger than the normal maximum size in
>> > NBD_INFO_BLOCK_SIZE; if either side doesn't request the info, then
>> > assume any maximum in NBD_INFO_BLOCK_SIZE applies, otherwise UINT32_MAX;
>> > and the two infos are separate items because NBD_FLAG_SEND_TRIM and
>> > NBD_FLAG_SEND_WRITE_ZEROES are orthogonally optional).
> mmmm ...
> 
> at this rate you'd be better with a list of command / maximum size
> tuples!

Not sure if serious or not, but anyway I think enumerating the items
individually should be enough.  It's not like commands grow on trees. :)

Paolo
Wouter Verhelst May 11, 2016, 9:06 p.m. UTC | #25
On Tue, May 10, 2016 at 04:08:50PM +0100, Alex Bligh wrote:
> What surprises me is that a release kernel is using experimental
> NBD extensions; there's no guarantee these won't change. Or does
> fstrim work some other way?

What makes you say NBD_CMD_TRIM is experimental? It's been implemented for
years.
Wouter Verhelst May 11, 2016, 9:10 p.m. UTC | #26
On Tue, May 10, 2016 at 09:29:23AM -0600, Eric Blake wrote:
> On 05/10/2016 09:08 AM, Alex Bligh wrote:
> > Eric,
> > 
> >> Hmm. The current wording of the experimental block size additions does
> >> NOT allow the client to send a NBD_CMD_TRIM with a size larger than the
> >> maximum NBD_CMD_WRITE:
> >> https://github.com/yoe/nbd/blob/extension-info/doc/proto.md#block-size-constraints
> > 
> > Correct
> > 
> >> Maybe we should revisit that in the spec, and/or advertise yet another
> >> block size (since the maximum size for a trim and/or write_zeroes
> >> request may indeed be different than the maximum size for a read/write).
> > 
> > I think it's up to the server to either handle large requests, or
> > for the client to break these up.
> 
> But the question at hand here is whether we should permit servers to
> advertise multiple maximum block sizes (one for read/write, another one
> for trim/write_zero, or even two [at least qemu tracks a separate
> maximum trim vs. write_zero sizing in its generic block layer]), or
> merely stick with the current wording that requires clients that honor
> maximum block size to obey the same maximum for ALL commands, regardless
> of amount of data sent over the wire.
> 
> > 
> > The core problem here is that the kernel (and, ahem, most servers) are
> > ignorant of the block size extension, and need to guess how to break
> > things up. In my view the client (kernel in this case) should
> > be breaking the trim requests up into whatever size it uses as the
> > maximum size write requests. But then it would have to know about block
> > sizes which are in (another) experimental extension.
> 
> Correct - no one has yet patched the kernel to honor block sizes
> advertised through what is currently an experimental extension.  (We
> have ioctl(NBD_SET_BLKSIZE) which can be argued to set the kernel's
> minimum block size, but I haven't audited whether the kernel actually
> guarantees that all client requests are sent aligned to the value passed
> that way - but we have nothing to set the maximum size, and are at the
> mercy of however the kernel currently decides to split large requests).

I don't actually think it does that at all, tbh. There is an
"integrityhuge" test in the reference server test suite which performs a
number of large requests (up to 50M), and which was created by a script
that just does direct read requests to /dev/nbdX.

It just so happens that most upper layers (filesystems etc) don't make
requests larger than about 32MiB, but that's not related.

> So the kernel is currently one of the clients that does NOT honor block
> sizes, and as such, servers should be prepared for ANY size up to
> UINT_MAX (other than DoS handling).  My question above only applies to
> clients that use the experimental block size extensions.

Right.

[...]
Wouter Verhelst May 11, 2016, 9:12 p.m. UTC | #27
On Tue, May 10, 2016 at 04:38:29PM +0100, Alex Bligh wrote:
> On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote:
> > So the kernel is currently one of the clients that does NOT honor block
> > sizes, and as such, servers should be prepared for ANY size up to
> > UINT_MAX (other than DoS handling).
> 
> Or not to permit a connection.

Right -- and this is why I was recommending against making this a MUST in the
first place.

[...]
Alex Bligh May 12, 2016, 3:03 p.m. UTC | #28
On 11 May 2016, at 22:06, Wouter Verhelst <w@uter.be> wrote:

> On Tue, May 10, 2016 at 04:08:50PM +0100, Alex Bligh wrote:
>> What surprises me is that a release kernel is using experimental
>> NBD extensions; there's no guarantee these won't change. Or does
>> fstrim work some other way?
> 
> What makes you say NBD_CMD_TRIM is experimental? It's been implemented for
> years.

Me confusing it with NBD_CMD_WRITE_ZEROES which I corrected later
in the thread.
Alex Bligh May 12, 2016, 3:33 p.m. UTC | #29
On 11 May 2016, at 22:12, Wouter Verhelst <w@uter.be> wrote:

> On Tue, May 10, 2016 at 04:38:29PM +0100, Alex Bligh wrote:
>> On 10 May 2016, at 16:29, Eric Blake <eblake@redhat.com> wrote:
>>> So the kernel is currently one of the clients that does NOT honor block
>>> sizes, and as such, servers should be prepared for ANY size up to
>>> UINT_MAX (other than DoS handling).
>> 
>> Or not to permit a connection.
> 
> Right -- and this is why I was recommending against making this a MUST in the
> first place.

Indeed, and it currently is a 'MAY':

> except that if a server believes a client's behaviour constitutes
> a denial of service, it MAY initiate a hard disconnect.
diff mbox

Patch

diff --git a/nbd.c b/nbd.c
index b3d9654..e733669 100644
--- a/nbd.c
+++ b/nbd.c
@@ -1209,6 +1209,11 @@  static ssize_t nbd_co_send_reply(NBDRequest *req, struct nbd_reply *reply,
     return rc;
 }
 
+static bool nbd_should_check_request_size(const struct nbd_request *request)
+{
+        return (request->type & NBD_CMD_MASK_COMMAND) != NBD_CMD_TRIM;
+}
+
 static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request *request)
 {
     NBDClient *client = req->client;
@@ -1227,7 +1232,8 @@  static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request *reque
         goto out;
     }
 
-    if (request->len > NBD_MAX_BUFFER_SIZE) {
+    if (nbd_should_check_request_size(request) &&
+        request->len > NBD_MAX_BUFFER_SIZE) {
         LOG("len (%u) is larger than max len (%u)",
             request->len, NBD_MAX_BUFFER_SIZE);
         rc = -EINVAL;