Message ID | 20211027085208.16048-2-tonylu@linux.alibaba.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Fixes for SMC | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Series has a cover letter |
netdev/fixes_present | success | Fixes tag present in non-next series |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | success | CCed 5 of 5 maintainers |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Signed-off-by tag matches author and committer |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 0 this patch: 0 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | No Fixes tag |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 15 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 0 this patch: 0 |
netdev/header_inline | success | No static functions without inline keyword in header files |
On 27/10/2021 10:52, Tony Lu wrote: > From: Tony Lu <tony.ly@linux.alibaba.com> > > This reverts commit 6889b36da78a21a312d8b462c1fa25a03c2ff192. > > When using SMC to replace TCP, some userspace applications like netperf > don't check the return code of send syscall correctly, which means how > many bytes are sent. If rc of send() is smaller than expected, it should > try to send again, instead of exit directly. It is difficult to change > the uncorrect behaviors of userspace applications, so choose to revert it. Your change would restore the old behavior to handle all sockets like they are blocking sockets, trying forever to send the provided data bytes. This is not how it should work. We encountered the same issue with netperf, but this is the only 'broken' application that we know of so far which does not implement the socket API correctly.
On Wed, 27 Oct 2021 12:21:32 +0200 Karsten Graul wrote: > On 27/10/2021 10:52, Tony Lu wrote: > > From: Tony Lu <tony.ly@linux.alibaba.com> > > > > This reverts commit 6889b36da78a21a312d8b462c1fa25a03c2ff192. > > > > When using SMC to replace TCP, some userspace applications like netperf > > don't check the return code of send syscall correctly, which means how > > many bytes are sent. If rc of send() is smaller than expected, it should > > try to send again, instead of exit directly. It is difficult to change > > the uncorrect behaviors of userspace applications, so choose to revert it. > > Your change would restore the old behavior to handle all sockets like they > are blocking sockets, trying forever to send the provided data bytes. > This is not how it should work. Isn't the application supposed to make the socket non-blocking or pass MSG_DONTWAIT if it doesn't want to sleep? It's unclear why the fix was needed in the first place. > We encountered the same issue with netperf, but this is the only 'broken' > application that we know of so far which does not implement the socket API > correctly.
On 27/10/2021 17:08, Jakub Kicinski wrote: > On Wed, 27 Oct 2021 12:21:32 +0200 Karsten Graul wrote: >> On 27/10/2021 10:52, Tony Lu wrote: >>> From: Tony Lu <tony.ly@linux.alibaba.com> >>> >>> This reverts commit 6889b36da78a21a312d8b462c1fa25a03c2ff192. >>> >>> When using SMC to replace TCP, some userspace applications like netperf >>> don't check the return code of send syscall correctly, which means how >>> many bytes are sent. If rc of send() is smaller than expected, it should >>> try to send again, instead of exit directly. It is difficult to change >>> the uncorrect behaviors of userspace applications, so choose to revert it. >> >> Your change would restore the old behavior to handle all sockets like they >> are blocking sockets, trying forever to send the provided data bytes. >> This is not how it should work. > > Isn't the application supposed to make the socket non-blocking or > pass MSG_DONTWAIT if it doesn't want to sleep? It's unclear why > the fix was needed in the first place. You are right, all of this non-blocking handling is already done in smc_tx_wait(). So this fix was for blocking sockets. The commit message explains the original intention: net/smc: don't wait for send buffer space when data was already sent When there is no more send buffer space and at least 1 byte was already sent then return to user space. The wait is only done when no data was sent by the sendmsg() call. This fixes smc_tx_sendmsg() which tried to always send all user data and started to wait for free send buffer space when needed. During this wait the user space program was blocked in the sendmsg() call and hence not able to receive incoming data. When both sides were in such a situation then the connection stalled forever. What we found out was that applications called sendmsg() with large data buffers using blocking sockets. This led to the described situation, were the solution was to early return to user space even if not all data were sent yet. Userspace applications should not have a problem with the fact that sendmsg() returns a smaller byte count than requested. Reverting this patch would bring back the stalled connection problem. >> We encountered the same issue with netperf, but this is the only 'broken' >> application that we know of so far which does not implement the socket API >> correctly. >
On Wed, 27 Oct 2021 17:38:27 +0200 Karsten Graul wrote: > What we found out was that applications called sendmsg() with large data > buffers using blocking sockets. This led to the described situation, were the > solution was to early return to user space even if not all data were sent yet. > Userspace applications should not have a problem with the fact that sendmsg() > returns a smaller byte count than requested. > > Reverting this patch would bring back the stalled connection problem. I'm not sure. The man page for send says: When the message does not fit into the send buffer of the socket, send() normally blocks, unless the socket has been placed in nonblock‐ ing I/O mode. In nonblocking mode it would fail with the error EAGAIN or EWOULDBLOCK in this case. dunno if that's required by POSIX or just a best practice.
On Wed, Oct 27, 2021 at 08:47:10AM -0700, Jakub Kicinski wrote: > On Wed, 27 Oct 2021 17:38:27 +0200 Karsten Graul wrote: > > What we found out was that applications called sendmsg() with large data > > buffers using blocking sockets. This led to the described situation, were the > > solution was to early return to user space even if not all data were sent yet. > > Userspace applications should not have a problem with the fact that sendmsg() > > returns a smaller byte count than requested. > > > > Reverting this patch would bring back the stalled connection problem. > > I'm not sure. The man page for send says: > > When the message does not fit into the send buffer of the socket, > send() normally blocks, unless the socket has been placed in nonblock‐ > ing I/O mode. In nonblocking mode it would fail with the error EAGAIN > or EWOULDBLOCK in this case. > > dunno if that's required by POSIX or just a best practice. The man page describes the common cases about the socket API behavior, but depends on the implement. For example, the connect(2) implement of TCP, it would never block, but also provides EAGAIN errors for UNIX domain sockets. EAGAIN For nonblocking UNIX domain sockets, the socket is nonblocking, and the connection cannot be completed immediately. For other socket families, there are insufficient entries in the routing cache. In my opinion, if we are going to replace TCP with SMC, these userspace socket API should behavior as same, and don't break the userspace applications like netperf. It could be better to block here when sending message without enough buffer. In our benchmarks and E2E tests (Redis, MySQL, etc.), it is acceptable to block here. Because the userspace applications usually block in the loop until the data send out. If it blocks, the scheduler can handle it.
On 27/10/2021 17:47, Jakub Kicinski wrote: > On Wed, 27 Oct 2021 17:38:27 +0200 Karsten Graul wrote: >> What we found out was that applications called sendmsg() with large data >> buffers using blocking sockets. This led to the described situation, were the >> solution was to early return to user space even if not all data were sent yet. >> Userspace applications should not have a problem with the fact that sendmsg() >> returns a smaller byte count than requested. >> >> Reverting this patch would bring back the stalled connection problem. > > I'm not sure. The man page for send says: > > When the message does not fit into the send buffer of the socket, > send() normally blocks, unless the socket has been placed in nonblock‐ > ing I/O mode. In nonblocking mode it would fail with the error EAGAIN > or EWOULDBLOCK in this case. > > dunno if that's required by POSIX or just a best practice. I see your point, and I am also not sure about how it should work in reality. The test case where the connection stalled is that both communication peers try to send data larger than there is space in the local send buffer plus the remote receive buffer. They use blocking sockets, so if the send() call is meant to send all data as requested then both sides would hang in send() forever/until a timeout. In our case both sides run a send/recv loop, so allowing send() to return lesser bytes then requested resulted in a follow-on recv() call which freed up space in the buffers, and the processing continues. There is also some discussion about this topic in this SO thread https://stackoverflow.com/questions/19697218/can-send-on-a-tcp-socket-return-0-and-length which points out that this (send returns smaller length) may happen already, e.g. when there is an interruption. So how to deal with all of this? Is it an accepted programming error when a user space program gets itself into this kind of situation? Since this problem depends on internal send/recv buffer sizes such a program might work on one system but not on other systems. At the end the question might be if either such kind of a 'deadlock' is acceptable, or if it is okay to have send() return lesser bytes than requested.
On Thu, 28 Oct 2021 13:57:55 +0200 Karsten Graul wrote: > So how to deal with all of this? Is it an accepted programming error > when a user space program gets itself into this kind of situation? > Since this problem depends on internal send/recv buffer sizes such a > program might work on one system but not on other systems. It's a gray area so unless someone else has a strong opinion we can leave it as is. > At the end the question might be if either such kind of a 'deadlock' > is acceptable, or if it is okay to have send() return lesser bytes > than requested. Yeah.. the thing is we have better APIs for applications to ask not to block than we do for applications to block. If someone really wants to wait for all data to come out for performance reasons they will struggle to get that behavior. We also have the small yet pernicious case where the buffer is completely full at sendmsg() time, IOW we didn't send a single byte. We won't be able to return "partial" results and deadlock. IDK if your application can hit this, but it should really use non-blocking send if it doesn't want blocking behavior..
On Thu, Oct 28, 2021 at 07:38:27AM -0700, Jakub Kicinski wrote: > On Thu, 28 Oct 2021 13:57:55 +0200 Karsten Graul wrote: > > So how to deal with all of this? Is it an accepted programming error > > when a user space program gets itself into this kind of situation? > > Since this problem depends on internal send/recv buffer sizes such a > > program might work on one system but not on other systems. > > It's a gray area so unless someone else has a strong opinion we can > leave it as is. Things might be different. IMHO, the key point of this problem is to implement the "standard" POSIX socket API, or TCP-socket compatible API. > > At the end the question might be if either such kind of a 'deadlock' > > is acceptable, or if it is okay to have send() return lesser bytes > > than requested. > > Yeah.. the thing is we have better APIs for applications to ask not to > block than we do for applications to block. If someone really wants to > wait for all data to come out for performance reasons they will > struggle to get that behavior. IMO, it is better to do something to unify this behavior. Some applications like netperf would be broken, and the people who want to use SMC to run basic benchmark, would be confused about this, and its compatibility with TCP. Maybe we could: 1) correct the behavior of netperf to check the rc as we discussed. 2) "copy" the behavior of TCP, and try to compatiable with TCP, though it is a gray area. Cheers, Tony Lu > We also have the small yet pernicious case where the buffer is > completely full at sendmsg() time, IOW we didn't send a single byte. > We won't be able to return "partial" results and deadlock. IDK if your > application can hit this, but it should really use non-blocking send if > it doesn't want blocking behavior..
On 01/11/2021 08:04, Tony Lu wrote: > On Thu, Oct 28, 2021 at 07:38:27AM -0700, Jakub Kicinski wrote: >> On Thu, 28 Oct 2021 13:57:55 +0200 Karsten Graul wrote: >>> So how to deal with all of this? Is it an accepted programming error >>> when a user space program gets itself into this kind of situation? >>> Since this problem depends on internal send/recv buffer sizes such a >>> program might work on one system but not on other systems. >> >> It's a gray area so unless someone else has a strong opinion we can >> leave it as is. > > Things might be different. IMHO, the key point of this problem is to > implement the "standard" POSIX socket API, or TCP-socket compatible API. > >>> At the end the question might be if either such kind of a 'deadlock' >>> is acceptable, or if it is okay to have send() return lesser bytes >>> than requested. >> >> Yeah.. the thing is we have better APIs for applications to ask not to >> block than we do for applications to block. If someone really wants to >> wait for all data to come out for performance reasons they will >> struggle to get that behavior. > > IMO, it is better to do something to unify this behavior. Some > applications like netperf would be broken, and the people who want to use > SMC to run basic benchmark, would be confused about this, and its > compatibility with TCP. Maybe we could: > 1) correct the behavior of netperf to check the rc as we discussed. > 2) "copy" the behavior of TCP, and try to compatiable with TCP, though > it is a gray area. I have a strong opinion here, so when the question is if the user either encounters a deadlock or if send() returns lesser bytes than requested, I prefer the latter behavior. The second case is much easier to debug for users, they can do something to handle the problem (loop around send()), and this case can even be detected using strace. But the deadlock case is nearly not debuggable by users and there is nothing to prevent it when the workload pattern runs into this situation (except to not use blocking sends).
On Tue, Nov 02, 2021 at 10:17:15AM +0100, Karsten Graul wrote: > On 01/11/2021 08:04, Tony Lu wrote: > > On Thu, Oct 28, 2021 at 07:38:27AM -0700, Jakub Kicinski wrote: > >> On Thu, 28 Oct 2021 13:57:55 +0200 Karsten Graul wrote: > >>> So how to deal with all of this? Is it an accepted programming error > >>> when a user space program gets itself into this kind of situation? > >>> Since this problem depends on internal send/recv buffer sizes such a > >>> program might work on one system but not on other systems. > >> > >> It's a gray area so unless someone else has a strong opinion we can > >> leave it as is. > > > > Things might be different. IMHO, the key point of this problem is to > > implement the "standard" POSIX socket API, or TCP-socket compatible API. > > > >>> At the end the question might be if either such kind of a 'deadlock' > >>> is acceptable, or if it is okay to have send() return lesser bytes > >>> than requested. > >> > >> Yeah.. the thing is we have better APIs for applications to ask not to > >> block than we do for applications to block. If someone really wants to > >> wait for all data to come out for performance reasons they will > >> struggle to get that behavior. > > > > IMO, it is better to do something to unify this behavior. Some > > applications like netperf would be broken, and the people who want to use > > SMC to run basic benchmark, would be confused about this, and its > > compatibility with TCP. Maybe we could: > > 1) correct the behavior of netperf to check the rc as we discussed. > > 2) "copy" the behavior of TCP, and try to compatiable with TCP, though > > it is a gray area. > > I have a strong opinion here, so when the question is if the user either > encounters a deadlock or if send() returns lesser bytes than requested, > I prefer the latter behavior. > The second case is much easier to debug for users, they can do something > to handle the problem (loop around send()), and this case can even be detected > using strace. But the deadlock case is nearly not debuggable by users and there > is nothing to prevent it when the workload pattern runs into this situation > (except to not use blocking sends). I agree with you. I am curious about this deadlock scene. If it was convenient, could you provide a reproducible test case? We are also setting up a SMC CI/CD system to find the compatible and performance fallback problems. Maybe we could do something to make it better. Cheers, Tony Lu
On 03/11/2021 04:06, Tony Lu wrote: > > I agree with you. I am curious about this deadlock scene. If it was > convenient, could you provide a reproducible test case? We are also > setting up a SMC CI/CD system to find the compatible and performance > fallback problems. Maybe we could do something to make it better. Run an echo server that uses blocking sockets. First call recv() with an 1MB buffer and when recv() returns then send all received bytes back to the client, no matter how many bytes where received. Run this in a loop (recv / send). On the client side also use only blocking sockets and a send / recv loop. Use an 1MB data buffer and call send() with the whole 1MB of data. Now run that with both scenarios (send() returns lesser bytes than requested vs. not).
diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c index 738a4a99c827..d401286e9058 100644 --- a/net/smc/smc_tx.c +++ b/net/smc/smc_tx.c @@ -178,11 +178,12 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len) conn->local_tx_ctrl.prod_flags.urg_data_pending = 1; if (!atomic_read(&conn->sndbuf_space) || conn->urg_tx_pend) { - if (send_done) - return send_done; rc = smc_tx_wait(smc, msg->msg_flags); - if (rc) + if (rc) { + if (send_done) + return send_done; goto out_err; + } continue; }