diff mbox

[Nbd,RESEND,0/5] nbd improvements

Message ID 20160915131741.cth6kilmcgnobbuu@grep.be (mailing list archive)
State New, archived
Headers show

Commit Message

Wouter Verhelst Sept. 15, 2016, 1:17 p.m. UTC
On Thu, Sep 15, 2016 at 01:44:29PM +0100, Alex Bligh wrote:
> 
> > On 15 Sep 2016, at 13:41, Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote:
> >> That's probably right in the case of file-based back ends that
> >> are running on a Linux OS. But gonbdserver for instance supports
> >> (e.g.) Ceph based backends, where each connection might be talking
> >> to a completely separate ceph node, and there may be no cache
> >> consistency between connections.
> > 
> > Yes, if you don't have a cache coherent backend you are generally
> > screwed with a multiqueue protocol.
> 
> I wonder if the ability to support multiqueue should be visible
> in the negotiation stage. That would allow the client to refuse
> to select multiqueue where it isn't safe.

The server can always refuse to allow multiple connections.

I was thinking of changing the spec as follows:


The latter bit (on the client side) is because even if your backend has
no cache coherency issues, TCP does not guarantee ordering between
multiple connections. I don't know if the above is in line with what
blk-mq does, but consider the following scenario:

- A client sends two writes to the server, followed (immediately) by a
  flush, where at least the second write and the flush are not sent over
  the same connection.
- The first write is a small one, and it is handled almost immediately.
- The second write takes a little longer, so the flush is handled
  earlier than the second write
- The network packet containing the flush reply gets lost for whatever
  reason, so the client doesn't get it, and we fall into TCP
  retransmits.
- The second write finishes, and its reply header does not get lost
- After the second write reply reaches the client, the TCP retransmits
  for the flush reply are handled.

In the above scenario, the flush reply arrives on the client side after
a write reply which it did not cover; so the client will (incorrectly)
assume that the write has reached permanent storage when in fact it may
not have done so yet.

If the kernel does not care about the ordering of the two writes versus
the flush, then there is no problem. I don't know how blk-mq works in
that context, but if the above is a likely scenario, we may have to
reconsider adding blk-mq to nbd.

Comments

Josef Bacik Sept. 15, 2016, 1:57 p.m. UTC | #1
On 09/15/2016 09:17 AM, Wouter Verhelst wrote:
> On Thu, Sep 15, 2016 at 01:44:29PM +0100, Alex Bligh wrote:
>>
>>> On 15 Sep 2016, at 13:41, Christoph Hellwig <hch@infradead.org> wrote:
>>>
>>> On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote:
>>>> That's probably right in the case of file-based back ends that
>>>> are running on a Linux OS. But gonbdserver for instance supports
>>>> (e.g.) Ceph based backends, where each connection might be talking
>>>> to a completely separate ceph node, and there may be no cache
>>>> consistency between connections.
>>>
>>> Yes, if you don't have a cache coherent backend you are generally
>>> screwed with a multiqueue protocol.
>>
>> I wonder if the ability to support multiqueue should be visible
>> in the negotiation stage. That would allow the client to refuse
>> to select multiqueue where it isn't safe.
>
> The server can always refuse to allow multiple connections.
>
> I was thinking of changing the spec as follows:
>
> diff --git a/doc/proto.md b/doc/proto.md
> index 217f57e..cb099e2 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -308,6 +308,23 @@ specification, the
>  [kernel documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
>  may be useful.
>
> +For performance reasons, clients MAY open multiple connections to the
> +same server. To support such clients, servers SHOULD ensure that at
> +least one of the following conditions hold:
> +
> +* Flush commands are processed for ALL connections. That is, when an
> +  `NBD_CMD_WRITE` is processed on one connection, and then an
> +  `NBD_CMD_FLUSH` is processed on another connection, the data of the
> +  `NBD_CMD_WRITE` on the first connection MUST reach permanent storage
> +  before the reply of the `NBD_CMD_FLUSH` is sent.
> +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one
> +  connection
> +* Multiple connections are not allowed
> +
> +In addition, clients using multiple connections SHOULD NOT send
> +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to
> +the flush has not been replied to yet.
> +
>  #### Request message
>
>  The request message, sent by the client, looks as follows:
>
> The latter bit (on the client side) is because even if your backend has
> no cache coherency issues, TCP does not guarantee ordering between
> multiple connections. I don't know if the above is in line with what
> blk-mq does, but consider the following scenario:
>
> - A client sends two writes to the server, followed (immediately) by a
>   flush, where at least the second write and the flush are not sent over
>   the same connection.
> - The first write is a small one, and it is handled almost immediately.
> - The second write takes a little longer, so the flush is handled
>   earlier than the second write
> - The network packet containing the flush reply gets lost for whatever
>   reason, so the client doesn't get it, and we fall into TCP
>   retransmits.
> - The second write finishes, and its reply header does not get lost
> - After the second write reply reaches the client, the TCP retransmits
>   for the flush reply are handled.
>
> In the above scenario, the flush reply arrives on the client side after
> a write reply which it did not cover; so the client will (incorrectly)
> assume that the write has reached permanent storage when in fact it may
> not have done so yet.

This isn't an NBD problem, this is an application problem.  The application must 
wait for all writes it cares about _before_ issuing a flush.  This is the same 
as for normal storage as it is for NBD.  It is not NBD's responsibility to 
maintain coherency between multiple requests across connections, just simply to 
act on and respond to requests.

I think changing the specification to indicate that this is the case for 
multiple connections is a good thing, to keep NBD servers from doing weird 
things like sending different connections to the same export to different 
backing stores without some sort of synchronization.  It should definitely be 
explicitly stated somewhere that NBD does not provide any ordering guarantees 
and that is up to the application.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Bligh Sept. 15, 2016, 3:17 p.m. UTC | #2
Josef,

> On 15 Sep 2016, at 14:57, Josef Bacik <jbacik@fb.com> wrote:
> 
> This isn't an NBD problem, this is an application problem.  The application must wait for all writes it cares about _before_ issuing a flush.  This is the same as for normal storage as it is for NBD.  It is not NBD's responsibility to maintain coherency between multiple requests across connections, just simply to act on and respond to requests.
> 
> I think changing the specification to indicate that this is the case for multiple connections is a good thing, to keep NBD servers from doing weird things like sending different connections to the same export to different backing stores without some sort of synchronization.  It should definitely be explicitly stated somewhere that NBD does not provide any ordering guarantees and that is up to the application.  Thanks,

I don't think that's correct.

The block stack issues a flush to mean (precisely) "do not reply to this until all preceding writes THAT HAVE BEEN REPLIED TO have been persisted to non-volatile storage".

The danger is with multiple connections (where apparently only one flush is sent - let's say down connection 1) that not al the writes that have been replied to on connection 2 have been persisted to non-volatile storage. Only the ones on connection 1 have been persisted (this is assuming the nbd server doesn't 'link' in some way the connections).

There's nothing the 'application' (here meaning the kernel or higher level) can do to mitigate this. Sure it can wait for all the replies, but this doesn't guarantee the writes have been persisted to non-volatile storage, precisely because writes may return prior to this.
Alex Bligh Sept. 15, 2016, 4:08 p.m. UTC | #3
Wouter,

> The server can always refuse to allow multiple connections.

Sure, but it would be neater to warn the client of that
at negotiation stage (it would only be one flag, e.g.
'multiple connections unsafe'). That way the connection
won't fail with a cryptic EBUSY or whatever, but will
just negotiate a single connection.

> I was thinking of changing the spec as follows:
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index 217f57e..cb099e2 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -308,6 +308,23 @@ specification, the
> [kernel documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
> may be useful.
> 
> +For performance reasons, clients MAY open multiple connections to the
> +same server. To support such clients, servers SHOULD ensure that at
> +least one of the following conditions hold:
> +
> +* Flush commands are processed for ALL connections. That is, when an
> +  `NBD_CMD_WRITE` is processed on one connection, and then an
> +  `NBD_CMD_FLUSH` is processed on another connection, the data of the
> +  `NBD_CMD_WRITE` on the first connection MUST reach permanent storage
> +  before the reply of the `NBD_CMD_FLUSH` is sent.
> +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one
> +  connection
> +* Multiple connections are not allowed
> +
> +In addition, clients using multiple connections SHOULD NOT send
> +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to
> +the flush has not been replied to yet.
> +

I don't think that should be a mandatory behaviour. For once, it would
be reasonably easy on gonbdserver but a PITA on the reference server.
You'd need to put in IPC between each of the forked processes OR rely
on fdatasync() only - I'm not sure that would necessarily work
safely with (e.g.) the 'treefiles' / COW options.

I think better would be to say that the server MUST either

* Not support NBD_CMD_FLUSH at all
* Support NBD_CMD_FLUSH across channels (as you set out above), or
* Indicate that it does not support multiple channels.

> #### Request message
> 
> The request message, sent by the client, looks as follows:
> 
> The latter bit (on the client side) is because even if your backend has
> no cache coherency issues, TCP does not guarantee ordering between
> multiple connections. I don't know if the above is in line with what
> blk-mq does, but consider the following scenario:
> 
> - A client sends two writes to the server, followed (immediately) by a
>  flush, where at least the second write and the flush are not sent over
>  the same connection.
> - The first write is a small one, and it is handled almost immediately.
> - The second write takes a little longer, so the flush is handled
>  earlier than the second write
> - The network packet containing the flush reply gets lost for whatever
>  reason, so the client doesn't get it, and we fall into TCP
>  retransmits.
> - The second write finishes, and its reply header does not get lost
> - After the second write reply reaches the client, the TCP retransmits
>  for the flush reply are handled.
> 
> In the above scenario, the flush reply arrives on the client side after
> a write reply which it did not cover; so the client will (incorrectly)
> assume that the write has reached permanent storage when in fact it may
> not have done so yet.
> 
> If the kernel does not care about the ordering of the two writes versus
> the flush, then there is no problem. I don't know how blk-mq works in
> that context, but if the above is a likely scenario, we may have to
> reconsider adding blk-mq to nbd.

Actually I think this is a problem anyway. A simpler failure case is
one where (by chance) one channel gets the writes, and one channel
gets the flushes. The flush reply is delayed beyond the replies to
unconnected writes (on the other channel) and hence the kernel thinks
replied-to writes have been persisted when they have not been.

The only way to fix that (as far as I can see) without changing flush
semantics is for the block layer to issue flush requests on each
channel when flushing on one channel. This, incidentally, would
resolve every other issue!
Wouter Verhelst Sept. 15, 2016, 4:27 p.m. UTC | #4
On Thu, Sep 15, 2016 at 05:08:21PM +0100, Alex Bligh wrote:
> Wouter,
> 
> > The server can always refuse to allow multiple connections.
> 
> Sure, but it would be neater to warn the client of that at negotiation
> stage (it would only be one flag, e.g.  'multiple connections
> unsafe').

I suppose that's not a bad idea.

[...]
> > I was thinking of changing the spec as follows:
> > 
> > diff --git a/doc/proto.md b/doc/proto.md
> > index 217f57e..cb099e2 100644
> > --- a/doc/proto.md
> > +++ b/doc/proto.md
> > @@ -308,6 +308,23 @@ specification, the
> > [kernel documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
> > may be useful.
> > 
> > +For performance reasons, clients MAY open multiple connections to the
> > +same server. To support such clients, servers SHOULD ensure that at
> > +least one of the following conditions hold:
> > +
> > +* Flush commands are processed for ALL connections. That is, when an
> > +  `NBD_CMD_WRITE` is processed on one connection, and then an
> > +  `NBD_CMD_FLUSH` is processed on another connection, the data of the
> > +  `NBD_CMD_WRITE` on the first connection MUST reach permanent storage
> > +  before the reply of the `NBD_CMD_FLUSH` is sent.
> > +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one
> > +  connection
> > +* Multiple connections are not allowed
> > +
> > +In addition, clients using multiple connections SHOULD NOT send
> > +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to
> > +the flush has not been replied to yet.
> > +
> 
> I don't think that should be a mandatory behaviour.

Which part of it?

> For once, it would
> be reasonably easy on gonbdserver but a PITA on the reference server.
> You'd need to put in IPC between each of the forked processes OR rely
> on fdatasync() only - I'm not sure that would necessarily work
> safely with (e.g.) the 'treefiles' / COW options.
> 
> I think better would be to say that the server MUST either
> 
> * Not support NBD_CMD_FLUSH at all

I think we should discourage not supporting FLUSH, rather than
suggesting it. 

> * Support NBD_CMD_FLUSH across channels (as you set out above), or
> * Indicate that it does not support multiple channels.

You dropped the one with no writes. I said "at most" there for a reason.
Originally I was going to say "if the server is read-only", but then
thought that it could work to do the "at most" thing. After having given
that some more thought, I now realize that if you write, you need to
flush across to other channels, regardless of whether they write too, so
that bit of it is moot now anyway.

Still, a server which exports read-only should still be safe for
multiple connections, even if there is no cache coherency (since
presumably nothing's going to change anyway).

[...]
> > The latter bit (on the client side) is because even if your backend has
> > no cache coherency issues, TCP does not guarantee ordering between
> > multiple connections. I don't know if the above is in line with what
> > blk-mq does, but consider the following scenario:
> > 
> > - A client sends two writes to the server, followed (immediately) by a
> >  flush, where at least the second write and the flush are not sent over
> >  the same connection.
> > - The first write is a small one, and it is handled almost immediately.
> > - The second write takes a little longer, so the flush is handled
> >  earlier than the second write
> > - The network packet containing the flush reply gets lost for whatever
> >  reason, so the client doesn't get it, and we fall into TCP
> >  retransmits.
> > - The second write finishes, and its reply header does not get lost
> > - After the second write reply reaches the client, the TCP retransmits
> >  for the flush reply are handled.
> > 
> > In the above scenario, the flush reply arrives on the client side after
> > a write reply which it did not cover; so the client will (incorrectly)
> > assume that the write has reached permanent storage when in fact it may
> > not have done so yet.
> > 
> > If the kernel does not care about the ordering of the two writes versus
> > the flush, then there is no problem. I don't know how blk-mq works in
> > that context, but if the above is a likely scenario, we may have to
> > reconsider adding blk-mq to nbd.
> 
> Actually I think this is a problem anyway. A simpler failure case is
> one where (by chance) one channel gets the writes, and one channel
> gets the flushes. The flush reply is delayed beyond the replies to
> unconnected writes (on the other channel) and hence the kernel thinks
> replied-to writes have been persisted when they have not been.

Yes, that is another example of essentially the same problem.

> The only way to fix that (as far as I can see) without changing flush
> semantics is for the block layer to issue flush requests on each
> channel when flushing on one channel.

Christoph just said that that doesn't (currently) happen; I don't know
whether the kernel currently already (optimistically) sends out flush
requests before the writes that it expects to hit permanent storage have
finished, but if it doesn't do that, then there is no problem and my
suggested bit of spec would be okay.

If there are good reasons to do so, however, we do indeed have a problem
and something else is necessary. I don't think flushing across all
connections is the best solution, though.
Alex Bligh Sept. 15, 2016, 4:42 p.m. UTC | #5
Wouter,

> On 15 Sep 2016, at 17:27, Wouter Verhelst <w@uter.be> wrote:
> 
> On Thu, Sep 15, 2016 at 05:08:21PM +0100, Alex Bligh wrote:
>> Wouter,
>> 
>>> The server can always refuse to allow multiple connections.
>> 
>> Sure, but it would be neater to warn the client of that at negotiation
>> stage (it would only be one flag, e.g.  'multiple connections
>> unsafe').
> 
> I suppose that's not a bad idea.

Good.

> [...]
>>> I was thinking of changing the spec as follows:
>>> 
>>> diff --git a/doc/proto.md b/doc/proto.md
>>> index 217f57e..cb099e2 100644
>>> --- a/doc/proto.md
>>> +++ b/doc/proto.md
>>> @@ -308,6 +308,23 @@ specification, the
>>> [kernel documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
>>> may be useful.
>>> 
>>> +For performance reasons, clients MAY open multiple connections to the
>>> +same server. To support such clients, servers SHOULD ensure that at
>>> +least one of the following conditions hold:
>>> +
>>> +* Flush commands are processed for ALL connections. That is, when an
>>> +  `NBD_CMD_WRITE` is processed on one connection, and then an
>>> +  `NBD_CMD_FLUSH` is processed on another connection, the data of the
>>> +  `NBD_CMD_WRITE` on the first connection MUST reach permanent storage
>>> +  before the reply of the `NBD_CMD_FLUSH` is sent.
>>> +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one
>>> +  connection
>>> +* Multiple connections are not allowed
>>> +
>>> +In addition, clients using multiple connections SHOULD NOT send
>>> +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to
>>> +the flush has not been replied to yet.
>>> +
>> 
>> I don't think that should be a mandatory behaviour.
> 
> Which part of it?

I've read it again :-) The wording was slightly contorted. I think
what I mean is that if you don't support flush at all, that's
another option.

The final paragraph I am not sure is right, as that's not what the kernel
currently does. If we are going to suggest a change in our main client's
behaviour, should we not just request that flush is done on all channels?

>> For once, it would
>> be reasonably easy on gonbdserver but a PITA on the reference server.
>> You'd need to put in IPC between each of the forked processes OR rely
>> on fdatasync() only - I'm not sure that would necessarily work
>> safely with (e.g.) the 'treefiles' / COW options.
>> 
>> I think better would be to say that the server MUST either
>> 
>> * Not support NBD_CMD_FLUSH at all
> 
> I think we should discourage not supporting FLUSH, rather than
> suggesting it. 

Sure, but some backends just don't support flush. For them, this
aspect at least is not a problem.

>> * Support NBD_CMD_FLUSH across channels (as you set out above), or
>> * Indicate that it does not support multiple channels.
> 
> You dropped the one with no writes. I said "at most" there for a reason.
> Originally I was going to say "if the server is read-only", but then
> thought that it could work to do the "at most" thing. After having given
> that some more thought, I now realize that if you write, you need to
> flush across to other channels, regardless of whether they write too, so
> that bit of it is moot now anyway.
> 
> Still, a server which exports read-only should still be safe for
> multiple connections, even if there is no cache coherency (since
> presumably nothing's going to change anyway).

Yes

>> Actually I think this is a problem anyway. A simpler failure case is
>> one where (by chance) one channel gets the writes, and one channel
>> gets the flushes. The flush reply is delayed beyond the replies to
>> unconnected writes (on the other channel) and hence the kernel thinks
>> replied-to writes have been persisted when they have not been.
> 
> Yes, that is another example of essentially the same problem.

Yeah, I was just trying to simplify things.

>> The only way to fix that (as far as I can see) without changing flush
>> semantics is for the block layer to issue flush requests on each
>> channel when flushing on one channel.
> 
> Christoph just said that that doesn't (currently) happen; I don't know
> whether the kernel currently already (optimistically) sends out flush
> requests before the writes that it expects to hit permanent storage have
> finished, but if it doesn't do that, then there is no problem and my
> suggested bit of spec would be okay.
> 
> If there are good reasons to do so, however, we do indeed have a problem
> and something else is necessary. I don't think flushing across all
> connections is the best solution, though.

Well, the way I look at it is that we have a proposed change in
client behaviour (multiple channels) which causes problems at
least with flush and also (I think) with cache coherency (see other
email). We should either not make that change, or ensure other changes
are added which mitigate these issues.

Flush is actually the obvious one. Cache coherency is far more
subtle (though possibly fixable by something in the spec that
states that if multiple connections are supported, cache must
be coherent between them).
Eric Blake Sept. 15, 2016, 7:02 p.m. UTC | #6
On 09/15/2016 11:27 AM, Wouter Verhelst wrote:
> On Thu, Sep 15, 2016 at 05:08:21PM +0100, Alex Bligh wrote:
>> Wouter,
>>
>>> The server can always refuse to allow multiple connections.
>>
>> Sure, but it would be neater to warn the client of that at negotiation
>> stage (it would only be one flag, e.g.  'multiple connections
>> unsafe').
> 
> I suppose that's not a bad idea.

Seems like it may need to be a per-export flag, rather than a global
flag (as a given server may be able to serve multiple types of files,
where the safety depends on the type of file being served).
diff mbox

Patch

diff --git a/doc/proto.md b/doc/proto.md
index 217f57e..cb099e2 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -308,6 +308,23 @@  specification, the
 [kernel documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
 may be useful.
 
+For performance reasons, clients MAY open multiple connections to the
+same server. To support such clients, servers SHOULD ensure that at
+least one of the following conditions hold:
+
+* Flush commands are processed for ALL connections. That is, when an
+  `NBD_CMD_WRITE` is processed on one connection, and then an
+  `NBD_CMD_FLUSH` is processed on another connection, the data of the
+  `NBD_CMD_WRITE` on the first connection MUST reach permanent storage
+  before the reply of the `NBD_CMD_FLUSH` is sent.
+* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one
+  connection
+* Multiple connections are not allowed
+
+In addition, clients using multiple connections SHOULD NOT send
+`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to
+the flush has not been replied to yet.
+
 #### Request message
 
 The request message, sent by the client, looks as follows: