diff mbox series

[net,1/4] Revert "net/smc: don't wait for send buffer space when data was already sent"

Message ID 20211027085208.16048-2-tonylu@linux.alibaba.com (mailing list archive)
State Superseded
Headers show
Series Fixes for SMC | expand

Commit Message

Tony Lu Oct. 27, 2021, 8:52 a.m. UTC
From: Tony Lu <tony.ly@linux.alibaba.com>

This reverts commit 6889b36da78a21a312d8b462c1fa25a03c2ff192.

When using SMC to replace TCP, some userspace applications like netperf
don't check the return code of send syscall correctly, which means how
many bytes are sent. If rc of send() is smaller than expected, it should
try to send again, instead of exit directly. It is difficult to change
the uncorrect behaviors of userspace applications, so choose to revert it.

Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Ursula Braun <ubraun@linux.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Reported-by: Jacob Qi <jacob.qi@linux.alibaba.com>
Signed-off-by: Tony Lu <tony.ly@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
---
 net/smc/smc_tx.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Karsten Graul Oct. 27, 2021, 10:21 a.m. UTC | #1
On 27/10/2021 10:52, Tony Lu wrote:
> From: Tony Lu <tony.ly@linux.alibaba.com>
> 
> This reverts commit 6889b36da78a21a312d8b462c1fa25a03c2ff192.
> 
> When using SMC to replace TCP, some userspace applications like netperf
> don't check the return code of send syscall correctly, which means how
> many bytes are sent. If rc of send() is smaller than expected, it should
> try to send again, instead of exit directly. It is difficult to change
> the uncorrect behaviors of userspace applications, so choose to revert it.

Your change would restore the old behavior to handle all sockets like they 
are blocking sockets, trying forever to send the provided data bytes.
This is not how it should work.

We encountered the same issue with netperf, but this is the only 'broken'
application that we know of so far which does not implement the socket API
correctly.
Jakub Kicinski Oct. 27, 2021, 3:08 p.m. UTC | #2
On Wed, 27 Oct 2021 12:21:32 +0200 Karsten Graul wrote:
> On 27/10/2021 10:52, Tony Lu wrote:
> > From: Tony Lu <tony.ly@linux.alibaba.com>
> > 
> > This reverts commit 6889b36da78a21a312d8b462c1fa25a03c2ff192.
> > 
> > When using SMC to replace TCP, some userspace applications like netperf
> > don't check the return code of send syscall correctly, which means how
> > many bytes are sent. If rc of send() is smaller than expected, it should
> > try to send again, instead of exit directly. It is difficult to change
> > the uncorrect behaviors of userspace applications, so choose to revert it.  
> 
> Your change would restore the old behavior to handle all sockets like they 
> are blocking sockets, trying forever to send the provided data bytes.
> This is not how it should work.

Isn't the application supposed to make the socket non-blocking or
pass MSG_DONTWAIT if it doesn't want to sleep? It's unclear why 
the fix was needed in the first place.
 
> We encountered the same issue with netperf, but this is the only 'broken'
> application that we know of so far which does not implement the socket API
> correctly.
Karsten Graul Oct. 27, 2021, 3:38 p.m. UTC | #3
On 27/10/2021 17:08, Jakub Kicinski wrote:
> On Wed, 27 Oct 2021 12:21:32 +0200 Karsten Graul wrote:
>> On 27/10/2021 10:52, Tony Lu wrote:
>>> From: Tony Lu <tony.ly@linux.alibaba.com>
>>>
>>> This reverts commit 6889b36da78a21a312d8b462c1fa25a03c2ff192.
>>>
>>> When using SMC to replace TCP, some userspace applications like netperf
>>> don't check the return code of send syscall correctly, which means how
>>> many bytes are sent. If rc of send() is smaller than expected, it should
>>> try to send again, instead of exit directly. It is difficult to change
>>> the uncorrect behaviors of userspace applications, so choose to revert it.  
>>
>> Your change would restore the old behavior to handle all sockets like they 
>> are blocking sockets, trying forever to send the provided data bytes.
>> This is not how it should work.
> 
> Isn't the application supposed to make the socket non-blocking or
> pass MSG_DONTWAIT if it doesn't want to sleep? It's unclear why 
> the fix was needed in the first place.
  
You are right, all of this non-blocking handling is already done in smc_tx_wait().
So this fix was for blocking sockets. The commit message explains the original
intention:

    net/smc: don't wait for send buffer space when data was already sent

    When there is no more send buffer space and at least 1 byte was already
    sent then return to user space. The wait is only done when no data was
    sent by the sendmsg() call.
    This fixes smc_tx_sendmsg() which tried to always send all user data and
    started to wait for free send buffer space when needed. During this wait
    the user space program was blocked in the sendmsg() call and hence not
    able to receive incoming data. When both sides were in such a situation
    then the connection stalled forever.

What we found out was that applications called sendmsg() with large data
buffers using blocking sockets. This led to the described situation, were the
solution was to early return to user space even if not all data were sent yet.
Userspace applications should not have a problem with the fact that sendmsg()
returns a smaller byte count than requested.

Reverting this patch would bring back the stalled connection problem.

>> We encountered the same issue with netperf, but this is the only 'broken'
>> application that we know of so far which does not implement the socket API
>> correctly.
>
Jakub Kicinski Oct. 27, 2021, 3:47 p.m. UTC | #4
On Wed, 27 Oct 2021 17:38:27 +0200 Karsten Graul wrote:
> What we found out was that applications called sendmsg() with large data
> buffers using blocking sockets. This led to the described situation, were the
> solution was to early return to user space even if not all data were sent yet.
> Userspace applications should not have a problem with the fact that sendmsg()
> returns a smaller byte count than requested.
> 
> Reverting this patch would bring back the stalled connection problem.

I'm not sure. The man page for send says:

       When the message does not fit into  the  send  buffer  of  the  socket,
       send()  normally blocks, unless the socket has been placed in nonblock‐
       ing I/O mode.  In nonblocking mode it would fail with the error  EAGAIN
       or  EWOULDBLOCK in this case.

dunno if that's required by POSIX or just a best practice.
Tony Lu Oct. 28, 2021, 6:48 a.m. UTC | #5
On Wed, Oct 27, 2021 at 08:47:10AM -0700, Jakub Kicinski wrote:
> On Wed, 27 Oct 2021 17:38:27 +0200 Karsten Graul wrote:
> > What we found out was that applications called sendmsg() with large data
> > buffers using blocking sockets. This led to the described situation, were the
> > solution was to early return to user space even if not all data were sent yet.
> > Userspace applications should not have a problem with the fact that sendmsg()
> > returns a smaller byte count than requested.
> > 
> > Reverting this patch would bring back the stalled connection problem.
> 
> I'm not sure. The man page for send says:
> 
>        When the message does not fit into  the  send  buffer  of  the  socket,
>        send()  normally blocks, unless the socket has been placed in nonblock‐
>        ing I/O mode.  In nonblocking mode it would fail with the error  EAGAIN
>        or  EWOULDBLOCK in this case.
> 
> dunno if that's required by POSIX or just a best practice.

The man page describes the common cases about the socket API behavior,
but depends on the implement.

For example, the connect(2) implement of TCP, it would never block, but
also provides EAGAIN errors for UNIX domain sockets.

       EAGAIN For nonblocking UNIX domain sockets, the socket is
              nonblocking, and the connection cannot be completed
              immediately.  For other socket families, there are
              insufficient entries in the routing cache.

In my opinion, if we are going to replace TCP with SMC, these userspace
socket API should behavior as same, and don't break the userspace
applications like netperf. It could be better to block here when sending
message without enough buffer.

In our benchmarks and E2E tests (Redis, MySQL, etc.), it is acceptable to
block here. Because the userspace applications usually block in the loop
until the data send out. If it blocks, the scheduler can handle it.
Karsten Graul Oct. 28, 2021, 11:57 a.m. UTC | #6
On 27/10/2021 17:47, Jakub Kicinski wrote:
> On Wed, 27 Oct 2021 17:38:27 +0200 Karsten Graul wrote:
>> What we found out was that applications called sendmsg() with large data
>> buffers using blocking sockets. This led to the described situation, were the
>> solution was to early return to user space even if not all data were sent yet.
>> Userspace applications should not have a problem with the fact that sendmsg()
>> returns a smaller byte count than requested.
>>
>> Reverting this patch would bring back the stalled connection problem.
> 
> I'm not sure. The man page for send says:
> 
>        When the message does not fit into  the  send  buffer  of  the  socket,
>        send()  normally blocks, unless the socket has been placed in nonblock‐
>        ing I/O mode.  In nonblocking mode it would fail with the error  EAGAIN
>        or  EWOULDBLOCK in this case.
> 
> dunno if that's required by POSIX or just a best practice.

I see your point, and I am also not sure about how it should work in reality.

The test case where the connection stalled is that both communication peers try
to send data larger than there is space in the local send buffer plus the remote 
receive buffer. They use blocking sockets, so if the send() call is meant to send
all data as requested then both sides would hang in send() forever/until a timeout.

In our case both sides run a send/recv loop, so allowing send() to return lesser 
bytes then requested resulted in a follow-on recv() call which freed up space in the
buffers, and the processing continues.

There is also some discussion about this topic in this SO thread
    https://stackoverflow.com/questions/19697218/can-send-on-a-tcp-socket-return-0-and-length
which points out that this (send returns smaller length) may happen already, e.g. when there
is an interruption.

So how to deal with all of this? Is it an accepted programming error when a user space 
program gets itself into this kind of situation? Since this problem depends on 
internal send/recv buffer sizes such a program might work on one system but not on 
other systems.

At the end the question might be if either such kind of a 'deadlock' is acceptable, 
or if it is okay to have send() return lesser bytes than requested.
Jakub Kicinski Oct. 28, 2021, 2:38 p.m. UTC | #7
On Thu, 28 Oct 2021 13:57:55 +0200 Karsten Graul wrote:
> So how to deal with all of this? Is it an accepted programming error
> when a user space program gets itself into this kind of situation?
> Since this problem depends on internal send/recv buffer sizes such a
> program might work on one system but not on other systems.

It's a gray area so unless someone else has a strong opinion we can
leave it as is.

> At the end the question might be if either such kind of a 'deadlock'
> is acceptable, or if it is okay to have send() return lesser bytes
> than requested.

Yeah.. the thing is we have better APIs for applications to ask not to
block than we do for applications to block. If someone really wants to
wait for all data to come out for performance reasons they will
struggle to get that behavior. 

We also have the small yet pernicious case where the buffer is
completely full at sendmsg() time, IOW we didn't send a single byte.
We won't be able to return "partial" results and deadlock. IDK if your
application can hit this, but it should really use non-blocking send if
it doesn't want blocking behavior..
Tony Lu Nov. 1, 2021, 7:04 a.m. UTC | #8
On Thu, Oct 28, 2021 at 07:38:27AM -0700, Jakub Kicinski wrote:
> On Thu, 28 Oct 2021 13:57:55 +0200 Karsten Graul wrote:
> > So how to deal with all of this? Is it an accepted programming error
> > when a user space program gets itself into this kind of situation?
> > Since this problem depends on internal send/recv buffer sizes such a
> > program might work on one system but not on other systems.
> 
> It's a gray area so unless someone else has a strong opinion we can
> leave it as is.

Things might be different. IMHO, the key point of this problem is to
implement the "standard" POSIX socket API, or TCP-socket compatible API.

> > At the end the question might be if either such kind of a 'deadlock'
> > is acceptable, or if it is okay to have send() return lesser bytes
> > than requested.
> 
> Yeah.. the thing is we have better APIs for applications to ask not to
> block than we do for applications to block. If someone really wants to
> wait for all data to come out for performance reasons they will
> struggle to get that behavior. 

IMO, it is better to do something to unify this behavior. Some
applications like netperf would be broken, and the people who want to use
SMC to run basic benchmark, would be confused about this, and its
compatibility with TCP. Maybe we could:
1) correct the behavior of netperf to check the rc as we discussed.
2) "copy" the behavior of TCP, and try to compatiable with TCP, though
it is a gray area.


Cheers,
Tony Lu

> We also have the small yet pernicious case where the buffer is
> completely full at sendmsg() time, IOW we didn't send a single byte.
> We won't be able to return "partial" results and deadlock. IDK if your
> application can hit this, but it should really use non-blocking send if
> it doesn't want blocking behavior..
Karsten Graul Nov. 2, 2021, 9:17 a.m. UTC | #9
On 01/11/2021 08:04, Tony Lu wrote:
> On Thu, Oct 28, 2021 at 07:38:27AM -0700, Jakub Kicinski wrote:
>> On Thu, 28 Oct 2021 13:57:55 +0200 Karsten Graul wrote:
>>> So how to deal with all of this? Is it an accepted programming error
>>> when a user space program gets itself into this kind of situation?
>>> Since this problem depends on internal send/recv buffer sizes such a
>>> program might work on one system but not on other systems.
>>
>> It's a gray area so unless someone else has a strong opinion we can
>> leave it as is.
> 
> Things might be different. IMHO, the key point of this problem is to
> implement the "standard" POSIX socket API, or TCP-socket compatible API.
> 
>>> At the end the question might be if either such kind of a 'deadlock'
>>> is acceptable, or if it is okay to have send() return lesser bytes
>>> than requested.
>>
>> Yeah.. the thing is we have better APIs for applications to ask not to
>> block than we do for applications to block. If someone really wants to
>> wait for all data to come out for performance reasons they will
>> struggle to get that behavior. 
> 
> IMO, it is better to do something to unify this behavior. Some
> applications like netperf would be broken, and the people who want to use
> SMC to run basic benchmark, would be confused about this, and its
> compatibility with TCP. Maybe we could:
> 1) correct the behavior of netperf to check the rc as we discussed.
> 2) "copy" the behavior of TCP, and try to compatiable with TCP, though
> it is a gray area.

I have a strong opinion here, so when the question is if the user either
encounters a deadlock or if send() returns lesser bytes than requested,
I prefer the latter behavior.
The second case is much easier to debug for users, they can do something
to handle the problem (loop around send()), and this case can even be detected
using strace. But the deadlock case is nearly not debuggable by users and there
is nothing to prevent it when the workload pattern runs into this situation
(except to not use blocking sends).
Tony Lu Nov. 3, 2021, 3:06 a.m. UTC | #10
On Tue, Nov 02, 2021 at 10:17:15AM +0100, Karsten Graul wrote:
> On 01/11/2021 08:04, Tony Lu wrote:
> > On Thu, Oct 28, 2021 at 07:38:27AM -0700, Jakub Kicinski wrote:
> >> On Thu, 28 Oct 2021 13:57:55 +0200 Karsten Graul wrote:
> >>> So how to deal with all of this? Is it an accepted programming error
> >>> when a user space program gets itself into this kind of situation?
> >>> Since this problem depends on internal send/recv buffer sizes such a
> >>> program might work on one system but not on other systems.
> >>
> >> It's a gray area so unless someone else has a strong opinion we can
> >> leave it as is.
> > 
> > Things might be different. IMHO, the key point of this problem is to
> > implement the "standard" POSIX socket API, or TCP-socket compatible API.
> > 
> >>> At the end the question might be if either such kind of a 'deadlock'
> >>> is acceptable, or if it is okay to have send() return lesser bytes
> >>> than requested.
> >>
> >> Yeah.. the thing is we have better APIs for applications to ask not to
> >> block than we do for applications to block. If someone really wants to
> >> wait for all data to come out for performance reasons they will
> >> struggle to get that behavior. 
> > 
> > IMO, it is better to do something to unify this behavior. Some
> > applications like netperf would be broken, and the people who want to use
> > SMC to run basic benchmark, would be confused about this, and its
> > compatibility with TCP. Maybe we could:
> > 1) correct the behavior of netperf to check the rc as we discussed.
> > 2) "copy" the behavior of TCP, and try to compatiable with TCP, though
> > it is a gray area.
> 
> I have a strong opinion here, so when the question is if the user either
> encounters a deadlock or if send() returns lesser bytes than requested,
> I prefer the latter behavior.
> The second case is much easier to debug for users, they can do something
> to handle the problem (loop around send()), and this case can even be detected
> using strace. But the deadlock case is nearly not debuggable by users and there
> is nothing to prevent it when the workload pattern runs into this situation
> (except to not use blocking sends).

I agree with you. I am curious about this deadlock scene. If it was
convenient, could you provide a reproducible test case? We are also
setting up a SMC CI/CD system to find the compatible and performance
fallback problems. Maybe we could do something to make it better.

Cheers,
Tony Lu
Karsten Graul Nov. 6, 2021, 12:46 p.m. UTC | #11
On 03/11/2021 04:06, Tony Lu wrote:
> 
> I agree with you. I am curious about this deadlock scene. If it was
> convenient, could you provide a reproducible test case? We are also
> setting up a SMC CI/CD system to find the compatible and performance
> fallback problems. Maybe we could do something to make it better.

Run an echo server that uses blocking sockets. First call recv() with an 1MB buffer
and when recv() returns then send all received bytes back to the client, no matter
how many bytes where received. Run this in a loop (recv / send).
On the client side also use only blocking sockets and a send / recv loop. Use
an 1MB data buffer and call send() with the whole 1MB of data.
Now run that with both scenarios (send() returns lesser bytes than requested vs. not).
diff mbox series

Patch

diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index 738a4a99c827..d401286e9058 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -178,11 +178,12 @@  int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
 			conn->local_tx_ctrl.prod_flags.urg_data_pending = 1;
 
 		if (!atomic_read(&conn->sndbuf_space) || conn->urg_tx_pend) {
-			if (send_done)
-				return send_done;
 			rc = smc_tx_wait(smc, msg->msg_flags);
-			if (rc)
+			if (rc) {
+				if (send_done)
+					return send_done;
 				goto out_err;
+			}
 			continue;
 		}