[1/4] libceph: don't bail early from try_read() when skipping a message
diff mbox

Message ID 1455986729-12544-2-git-send-email-idryomov@gmail.com
State New
Headers show

Commit Message

Ilya Dryomov Feb. 20, 2016, 4:45 p.m. UTC
The contract between try_read() and try_write() is that when called
each processes as much data as possible.  When instructed by osd_client
to skip a message, try_read() is violating this contract by returning
after receiving and discarding a single message instead of checking for
more.  try_write() then gets a chance to write out more requests,
generating more replies/skips for try_read() to handle, forcing the
messenger into a starvation loop.

Cc: stable@vger.kernel.org # 3.10+
Reported-by: Varada Kari <Varada.Kari@sandisk.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Varada Kari <Varada.Kari@sandisk.com>
---
 net/ceph/messenger.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Alex Elder Feb. 20, 2016, 6:28 p.m. UTC | #1
On 02/20/2016 10:45 AM, Ilya Dryomov wrote:
> The contract between try_read() and try_write() is that when called
> each processes as much data as possible.  When instructed by osd_client
> to skip a message, try_read() is violating this contract by returning
> after receiving and discarding a single message instead of checking for
> more.  try_write() then gets a chance to write out more requests,
> generating more replies/skips for try_read() to handle, forcing the
> messenger into a starvation loop.

This makes sense.  If (for example) two messages are pending
(either arriving or already arrived and buffered), and the
first is going to be discarded, we should still process the
second one to the extent possible.


I'm going to review the way this stuff works. (I always have
to do that anyway it seems, so I document it in the process...)


There are two places when read_partial_message() will decide
to discard a message.  One is if a completely valid message
header arrives, but the sequence number is wrong (this means
re-sent message has arrived).  The second place the messenger
discards an incoming message is if the connection's message
buffer allocation method indicates the message should be
skipped.  The MDS msg_alloc method never calls for a buffer
to be skipped.  But the mon client and the OSD client do.

The mon client and OSD client issue requests to their
servers which are expected to receive response messages.
When a mon or OSD request is set up, a buffer to hold its
corresponding response message is preallocated.  So when
receiving a message on a mon or OSD connection, the message
allocation function used by the messenger looks up and
returns that preallocated buffer.  If a response message
arrives from a mon server or OSD was not expected (i.e.,
no outstanding request is waiting for that response) the
incoming response message should be skipped.

Additionally, an OSD client expects to receive only a few
types of messages.  It peeks at an incoming message to
determine how it should be handled.  If an incoming message
is not of a type recognized by the OSD client, the message
should be skipped.  Even if an incoming message is an OSD
response that was expected, it is possible that the buffer
allocated to receive the response is too small for the
incoming message; and in that case the incoming message
should be skipped.

So...  There are two places in try_read_message() where
incoming messages should be discarded, and they represent:
receipt of a duplicate (re-sent) message; receipt of an
unexpected message type; receipt of a response message
that was not expected; or receipt of a response that was
too large for the buffer that had been allocated to hold it.


A stream of data coming in a connection consists of blocks
of data, each of which begins with a "tag" that indicates
what is contained within the block that follows. A
connection's "in_tag" indicates what we're expecting to
see next on input, and if it's READY it means we've
finished reading a previous block (and so the next
byte should be a tag).

If a connection should discard some incoming data, this is
represented by having its "in_base_pos" value be negative.
The value of "in_base_pos" in this case is negative the
number of bytes to be discarded.  Whenever a negative
value is recorded in "in_base_pos", the connection's
"in_tag" field is set to READY, indicating that immediately
following the discarded bytes is expected to be a tag.

When a connection is in OPEN state, the first thing
try_read() does is discard the incoming message data if
it is supposed to be skipped.  Once this is done, it looks
at "in_tag" to determine what to do next, and if it's READY
it sets "in_tag" to the next byte on input, and prepares
for receiving the block of data that follows accordingly.


OK, now coming back to the issue at hand.  In try_read(),
if the current "in_tag" is MSG, a message is expected on
input, and read_partial_message() is called.  If that
returns 0 (or negative), try_read() completes and returns
to its caller.  Otherwise, if the connection's "in_tag"
is READY, try_read() jumps back to the top again, to
continue reading the next block of data on the input.

Currently, if read_partial_message() ever finds that the
incoming message should be skipped, it indicates that
by setting the negated number of bytes to be skipped in
the connection's "in_base_pos" field, and setting its
"in_tag" to READY.  HOWEVER, read_partial_message()
currently returns 0 in this case, which causes its
caller, try_read(), to quit reading.


This fix changes the two places in read_partial_message()
that arrange to skip the current incoming message so they
return 1 rather than 0.  The result is that when control
returns to try_read(), it will return to its top.  That
will cause the incoming message to be discarded, *and*
will cause the next block of incoming data to be processed.

Without this change, try_write() will get called again
before try_read() gets a chance to discard the current
message and read the next one.


In other words...

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Cc: stable@vger.kernel.org # 3.10+
> Reported-by: Varada Kari <Varada.Kari@sandisk.com>
> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
> Tested-by: Varada Kari <Varada.Kari@sandisk.com>
> ---
>  net/ceph/messenger.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index 9cfedf565f5b..fec20819a5ea 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -2337,7 +2337,7 @@ static int read_partial_message(struct ceph_connection *con)
>  		con->in_base_pos = -front_len - middle_len - data_len -
>  			sizeof(m->footer);
>  		con->in_tag = CEPH_MSGR_TAG_READY;
> -		return 0;
> +		return 1;
>  	} else if ((s64)seq - (s64)con->in_seq > 1) {
>  		pr_err("read_partial_message bad seq %lld expected %lld\n",
>  		       seq, con->in_seq + 1);
> @@ -2363,7 +2363,7 @@ static int read_partial_message(struct ceph_connection *con)
>  				sizeof(m->footer);
>  			con->in_tag = CEPH_MSGR_TAG_READY;
>  			con->in_seq++;
> -			return 0;
> +			return 1;
>  		}
>  
>  		BUG_ON(!con->in_msg);
> k

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilya Dryomov Feb. 20, 2016, 8:21 p.m. UTC | #2
On Sat, Feb 20, 2016 at 7:28 PM, Alex Elder <elder@ieee.org> wrote:
> On 02/20/2016 10:45 AM, Ilya Dryomov wrote:
>> The contract between try_read() and try_write() is that when called
>> each processes as much data as possible.  When instructed by osd_client
>> to skip a message, try_read() is violating this contract by returning
>> after receiving and discarding a single message instead of checking for
>> more.  try_write() then gets a chance to write out more requests,
>> generating more replies/skips for try_read() to handle, forcing the
>> messenger into a starvation loop.
>
> This makes sense.  If (for example) two messages are pending
> (either arriving or already arrived and buffered), and the
> first is going to be discarded, we should still process the
> second one to the extent possible.
>
>
> I'm going to review the way this stuff works. (I always have
> to do that anyway it seems, so I document it in the process...)
>
>
> There are two places when read_partial_message() will decide
> to discard a message.  One is if a completely valid message
> header arrives, but the sequence number is wrong (this means
> re-sent message has arrived).  The second place the messenger
> discards an incoming message is if the connection's message
> buffer allocation method indicates the message should be
> skipped.  The MDS msg_alloc method never calls for a buffer
> to be skipped.  But the mon client and the OSD client do.
>
> The mon client and OSD client issue requests to their
> servers which are expected to receive response messages.
> When a mon or OSD request is set up, a buffer to hold its
> corresponding response message is preallocated.  So when
> receiving a message on a mon or OSD connection, the message
> allocation function used by the messenger looks up and
> returns that preallocated buffer.  If a response message
> arrives from a mon server or OSD was not expected (i.e.,
> no outstanding request is waiting for that response) the
> incoming response message should be skipped.
>
> Additionally, an OSD client expects to receive only a few
> types of messages.  It peeks at an incoming message to
> determine how it should be handled.  If an incoming message
> is not of a type recognized by the OSD client, the message
> should be skipped.  Even if an incoming message is an OSD
> response that was expected, it is possible that the buffer
> allocated to receive the response is too small for the
> incoming message; and in that case the incoming message
> should be skipped.
>
>
> So...  There are two places in try_read_message() where
> incoming messages should be discarded, and they represent:
> receipt of a duplicate (re-sent) message; receipt of an
> unexpected message type; receipt of a response message
> that was not expected; or receipt of a response that was
> too large for the buffer that had been allocated to hold it.

Just a note: it's less about the preallocated buffer being too small
part (in theory we could allocate everything in alloc_msg(), although
that would be stupid) and more about the "no outstanding request is
waiting for that response" part.  Stray and/or duplicate replies from
the OSDs are expected in various cases.

>
>
> A stream of data coming in a connection consists of blocks
> of data, each of which begins with a "tag" that indicates
> what is contained within the block that follows. A
> connection's "in_tag" indicates what we're expecting to
> see next on input, and if it's READY it means we've
> finished reading a previous block (and so the next
> byte should be a tag).
>
> If a connection should discard some incoming data, this is
> represented by having its "in_base_pos" value be negative.
> The value of "in_base_pos" in this case is negative the
> number of bytes to be discarded.  Whenever a negative
> value is recorded in "in_base_pos", the connection's
> "in_tag" field is set to READY, indicating that immediately
> following the discarded bytes is expected to be a tag.
>
> When a connection is in OPEN state, the first thing
> try_read() does is discard the incoming message data if
> it is supposed to be skipped.  Once this is done, it looks
> at "in_tag" to determine what to do next, and if it's READY
> it sets "in_tag" to the next byte on input, and prepares
> for receiving the block of data that follows accordingly.
>
>
> OK, now coming back to the issue at hand.  In try_read(),
> if the current "in_tag" is MSG, a message is expected on
> input, and read_partial_message() is called.  If that
> returns 0 (or negative), try_read() completes and returns
> to its caller.  Otherwise, if the connection's "in_tag"
> is READY, try_read() jumps back to the top again, to
> continue reading the next block of data on the input.
>
> Currently, if read_partial_message() ever finds that the
> incoming message should be skipped, it indicates that
> by setting the negated number of bytes to be skipped in
> the connection's "in_base_pos" field, and setting its
> "in_tag" to READY.  HOWEVER, read_partial_message()
> currently returns 0 in this case, which causes its
> caller, try_read(), to quit reading.
>
>
> This fix changes the two places in read_partial_message()
> that arrange to skip the current incoming message so they
> return 1 rather than 0.  The result is that when control
> returns to try_read(), it will return to its top.  That
> will cause the incoming message to be discarded, *and*
> will cause the next block of incoming data to be processed.
>
> Without this change, try_write() will get called again
> before try_read() gets a chance to discard the current
> message and read the next one.
>
>
> In other words...
>
> Looks good.
>
> Reviewed-by: Alex Elder <elder@linaro.org>

Thanks for the review, Alex!  These walkthroughs of yours on the list
have been extremely helpful at times ;)

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Elder Feb. 20, 2016, 8:38 p.m. UTC | #3
On 02/20/2016 02:21 PM, Ilya Dryomov wrote:
>> > So...  There are two places in try_read_message() where
>> > incoming messages should be discarded, and they represent:
>> > receipt of a duplicate (re-sent) message; receipt of an
>> > unexpected message type; receipt of a response message
>> > that was not expected; or receipt of a response that was
>> > too large for the buffer that had been allocated to hold it.
> Just a note: it's less about the preallocated buffer being too small
> part (in theory we could allocate everything in alloc_msg(), although
> that would be stupid) and more about the "no outstanding request is
> waiting for that response" part.  Stray and/or duplicate replies from
> the OSDs are expected in various cases.

Agreed.  That one case (buffer too large) is anomalous.  But
the code *does* say "skip the message" in that case, which
is why I mentioned it.

Like I said before, when it's been a while since I've reviewed
code I need to go immerse myself in it for a little while.  And
while I'm at it, I figure I might as well explain it all for the
benefit of other people.  This is especially true for these tiny
patches--sometimes the smaller the patch, the bigger the context
needed to understand why it fixes a problem.

					-Alex

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 9cfedf565f5b..fec20819a5ea 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -2337,7 +2337,7 @@  static int read_partial_message(struct ceph_connection *con)
 		con->in_base_pos = -front_len - middle_len - data_len -
 			sizeof(m->footer);
 		con->in_tag = CEPH_MSGR_TAG_READY;
-		return 0;
+		return 1;
 	} else if ((s64)seq - (s64)con->in_seq > 1) {
 		pr_err("read_partial_message bad seq %lld expected %lld\n",
 		       seq, con->in_seq + 1);
@@ -2363,7 +2363,7 @@  static int read_partial_message(struct ceph_connection *con)
 				sizeof(m->footer);
 			con->in_tag = CEPH_MSGR_TAG_READY;
 			con->in_seq++;
-			return 0;
+			return 1;
 		}
 
 		BUG_ON(!con->in_msg);