mbox series

[net-next,v4,00/18] net/smc: implement virtual ISM extension and loopback-ism

Message ID 1695568613-125057-1-git-send-email-guwen@linux.alibaba.com (mailing list archive)
Headers show
Series net/smc: implement virtual ISM extension and loopback-ism | expand

Message

Wen Gu Sept. 24, 2023, 3:16 p.m. UTC
Hi, all

# Background

SMC-D is now used in IBM z with ISM function to optimize network interconnect
for intra-CPC communications. Inspired by this, we try to make SMC-D available
on the non-s390 architecture through a software-simulated virtual ISM device,
such as loopback-ism device here, to accelerate inter-process or inter-containers
communication within the same OS.

# Design

This patch set includes 4 parts:

 - Patch #1-#3: decouple ISM device hard code from SMC-D stack.
 - Patch #4-#8: implement virtual ISM extension defined in SMCv2.1.
 - Patch #9-#13: implement loopback-ism device.
 - Patch #14-#18: memory copy optimization for the case using loopback.

The loopback-ism device is designed as a kernel device and not be limited to
a specific net namespace, ends of both inter-process connection (1/1' in diagram
below) or inter-container connection (2/2' in diagram below) will find that peer
shares the same loopback-ism device during the CLC handshake. Then loopback-ism
device will be chosen.

 Container 1 (ns1)                              Container 2 (ns2)
 +-----------------------------------------+    +-------------------------+
 | +-------+      +-------+      +-------+ |    |        +-------+        |
 | | App A |      | App B |      | App C | |    |        | App D |<-+     |
 | +-------+      +---^---+      +-------+ |    |        +-------+  |(2') |
 |     |127.0.0.1 (1')|             |192.168.0.11       192.168.0.12|     |
 |  (1)|   +--------+ | +--------+  |(2)   |    | +--------+   +--------+ |
 |     `-->|   lo   |-` |  eth0  |<-`      |    | |   lo   |   |  eth0  | |
 +---------+--|---^-+---+-----|--+---------+    +-+--------+---+-^------+-+
              |   |           |                                  |
 Kernel       |   |           |                                  |
 +----+-------v---+-----------v----------------------------------+---+----+
 |    |                            TCP                               |    |
 |    |                                                              |    |
 |    +--------------------------------------------------------------+    |
 |                                                                        |
 |                           +--------------+                             |
 |                           | smc loopback |                             |
 +---------------------------+--------------+-----------------------------+


loopback-ism device allocs RMBs and sndbufs for each connection peer and 'moves'
data from sndbuf at one end to RMB at the other end. Since communication occurs
within the same kernel, the sndbuf can be mapped to peer RMB so that the data
copy in loopback-ism case can be avoided.

 Container 1 (ns1)                              Container 2 (ns2)
 +-----------------------------------------+    +-------------------------+
 | +-------+      +-------+      +-------+ |    |        +-------+        |
 | | App A |      | App B |      | App C | |    |        | App D |        |
 | +-------+      +--^----+      +-------+ |    |        +---^---+        |
 |       |           |               |     |    |            |            |
 |   (1) |      (1') |           (2) |     |    |       (2') |            |
 |       |           |               |     |    |            |            |
 +-------|-----------|---------------|-----+    +------------|------------+
         |           |               |                       |
 Kernel  |           |               |                       |
 +-------|-----------|---------------|-----------------------|------------+
 | +-----v-+      +-------+      +---v---+               +-------+        |
 | | snd A |-+    | RMB B |<--+  | snd C |-+          +->| RMB D |        |
 | +-------+ |    +-------+   |  +-------+ |          |  +-------+        |
 | +-------+ |    +-------+   |  +-------+ |          |  +-------+        |
 | | RMB A | |    | snd B |   |  | RMB C | |          |  | snd D |        |
 | +-------+ |    +-------+   |  +-------+ |          |  +-------+        |
 |           |               +-------------v+         |                   |
 |           +-------------->| smc loopback |---------+                   |
 +---------------------------+--------------+-----------------------------+

# Benchmark Test

 * Test environments:
      - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
      - SMC sndbuf/RMB size 1MB.

 * Test object:
      - TCP: run on TCP loopback.
      - domain: run on UNIX domain.
      - SMC lo: run on SMC loopback device.

1. ipc-benchmark (see [1])

 - ./<foo> -c 1000000 -s 100

                            TCP                  SMC-lo
Message
rate (msg/s)              81539                  151251(+85.50%)

2. sockperf

 - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
 - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30

                            TCP                  SMC-lo
Bandwidth(MBps)         5313.66                 8270.51(+55.65%)
Latency(us)               5.806                   3.207(-44.76%)

3. nginx/wrk

 - serv: <smc_run> nginx
 - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80

                           TCP                   SMC-lo
Requests/s           194641.79                258656.13(+32.89%)

4. redis-benchmark

 - serv: <smc_run> redis-server
 - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024

                           TCP                   SMC-lo
GET(Requests/s)       85855.34                115640.35(+34.69%)
SET(Requests/s)       86337.15                118203.30(+36.90%)

[1] https://github.com/goldsborough/ipc-bench

v3->v4:
 - Fix build warning of patch#12 about:
   dmb_node->dma_addr = (dma_addr_t)dmb_node->cpu_addr;
 - Replace kzalloc with vzalloc to alloc DMB in loopback-ism, which
   may cause throughput or QPS to drop by 2% to 8%. see:
   https://lore.kernel.org/netdev/d6facfd5-e083-ffc7-05e5-2e8f3ef17735@linux.alibaba.com/
 - Add SMC_LO in Kconfig to turn on/off loopback-ism.
 - Make extension GID cleaner.

v2->v3:
 - Fix build warning of patch#1 and patch#10.

v1->v2:
 - Fix build error on s390 arch.

Wen Gu (18):
  net/smc: decouple ism_dev from SMC-D device dump
  net/smc: decouple ism_dev from SMC-D DMB registration
  net/smc: extract v2 check helper from SMC-D device registration
  net/smc: support SMCv2.x supplemental features negotiation
  net/smc: reserve CHID range for SMC-D virtual device
  net/smc: extend GID to 128bits only for virtual ISM device
  net/smc: disable SEID on non-s390 architecture
  net/smc: enable virtual ISM device feature bit
  net/smc: introduce SMC-D loopback device
  net/smc: implement ID-related operations of loopback
  net/smc: implement some unsupported operations of loopback
  net/smc: implement DMB-related operations of loopback
  net/smc: register loopback device as SMC-Dv2 device
  net/smc: add operation for getting DMB attribute
  net/smc: add operations for DMB attach and detach
  net/smc: avoid data copy from sndbuf to peer RMB in SMC-D
  net/smc: modify cursor update logic when sndbuf mapped to RMB
  net/smc: add interface implementation of loopback device

 drivers/s390/net/ism_drv.c    |  20 +-
 include/net/smc.h             |  32 ++-
 include/uapi/linux/smc.h      |   3 +
 include/uapi/linux/smc_diag.h |   2 +
 net/smc/Kconfig               |  13 ++
 net/smc/Makefile              |   2 +-
 net/smc/af_smc.c              |  88 ++++++--
 net/smc/smc.h                 |   7 +
 net/smc/smc_cdc.c             |  56 ++++-
 net/smc/smc_cdc.h             |   1 +
 net/smc/smc_clc.c             |  64 ++++--
 net/smc/smc_clc.h             |  10 +-
 net/smc/smc_core.c            | 111 +++++++++-
 net/smc/smc_core.h            |   9 +-
 net/smc/smc_diag.c            |  11 +-
 net/smc/smc_ism.c             | 100 ++++++---
 net/smc/smc_ism.h             |  24 ++-
 net/smc/smc_loopback.c        | 489 ++++++++++++++++++++++++++++++++++++++++++
 net/smc/smc_loopback.h        |  54 +++++
 net/smc/smc_pnet.c            |   4 +-
 20 files changed, 996 insertions(+), 104 deletions(-)
 create mode 100644 net/smc/smc_loopback.c
 create mode 100644 net/smc/smc_loopback.h

Comments

Alexandra Winter Sept. 26, 2023, 7:30 a.m. UTC | #1
On 24.09.23 17:16, Wen Gu wrote:
> This patch set includes 4 parts:
> 
>  - Patch #1-#3: decouple ISM device hard code from SMC-D stack.
>  - Patch #4-#8: implement virtual ISM extension defined in SMCv2.1.
>  - Patch #9-#13: implement loopback-ism device.
>  - Patch #14-#18: memory copy optimization for the case using loopback.


Your cover letter is very well helpful, thanks a lot for that.
I really like the way you grouped the patches.
Just a thought:
If it takes too long to get this large patchset reviewed, you could
split it up into smaller sets and get them upstream one after the other. 

I think it is especially valuable to more crisply define the interface between
SMC-D and the smc-d devices, given that virtio-ism may soon be a third kind of
smcd device.
Alexandra Winter Sept. 27, 2023, 3:16 p.m. UTC | #2
On 24.09.23 17:16, Wen Gu wrote:
> Wen Gu (18):
>   net/smc: decouple ism_dev from SMC-D device dump
>   net/smc: decouple ism_dev from SMC-D DMB registration
>   net/smc: extract v2 check helper from SMC-D device registration
>   net/smc: support SMCv2.x supplemental features negotiation
>   net/smc: reserve CHID range for SMC-D virtual device
>   net/smc: extend GID to 128bits only for virtual ISM device
>   net/smc: disable SEID on non-s390 architecture
>   net/smc: enable virtual ISM device feature bit
>   net/smc: introduce SMC-D loopback device
>   net/smc: implement ID-related operations of loopback
>   net/smc: implement some unsupported operations of loopback
>   net/smc: implement DMB-related operations of loopback
>   net/smc: register loopback device as SMC-Dv2 device
>   net/smc: add operation for getting DMB attribute
>   net/smc: add operations for DMB attach and detach
>   net/smc: avoid data copy from sndbuf to peer RMB in SMC-D
>   net/smc: modify cursor update logic when sndbuf mapped to RMB
>   net/smc: add interface implementation of loopback device
> 
>  drivers/s390/net/ism_drv.c    |  20 +-
>  include/net/smc.h             |  32 ++-
>  include/uapi/linux/smc.h      |   3 +
>  include/uapi/linux/smc_diag.h |   2 +
>  net/smc/Kconfig               |  13 ++
>  net/smc/Makefile              |   2 +-
>  net/smc/af_smc.c              |  88 ++++++--
>  net/smc/smc.h                 |   7 +
>  net/smc/smc_cdc.c             |  56 ++++-
>  net/smc/smc_cdc.h             |   1 +
>  net/smc/smc_clc.c             |  64 ++++--
>  net/smc/smc_clc.h             |  10 +-
>  net/smc/smc_core.c            | 111 +++++++++-
>  net/smc/smc_core.h            |   9 +-
>  net/smc/smc_diag.c            |  11 +-
>  net/smc/smc_ism.c             | 100 ++++++---
>  net/smc/smc_ism.h             |  24 ++-
>  net/smc/smc_loopback.c        | 489 ++++++++++++++++++++++++++++++++++++++++++
>  net/smc/smc_loopback.h        |  54 +++++
>  net/smc/smc_pnet.c            |   4 +-
>  20 files changed, 996 insertions(+), 104 deletions(-)
>  create mode 100644 net/smc/smc_loopback.c
>  create mode 100644 net/smc/smc_loopback.h


Hello Wen Gu,

I applied and built your patches and noticed some things that you may want to consider in the next version:

Series should be split up [2]

Several lines exceed 80 columns [1][3]

'git clang-format HEAD~18' finds several formatting issues.
	Maybe not all of them need to be fixed.

codespell *.patch
0006-net-smc-extend-GID-to-128bits-only-for-virtual-ISM-d.patch:7: protocal ==> protocol

With your patches applied I get some new warnings [4]:
Seems there are some ntoh conversions missing

  CHECK   net/smc/af_smc.c
net/smc/af_smc.c:723:32: warning: cast to restricted __be64
net/smc/af_smc.c:1427:52: warning: cast to restricted __be64
  CHECK   net/smc/smc_pnet.c
  CHECK   net/smc/smc_ib.c
  CHECK   net/smc/smc_clc.c
net/smc/smc_clc.c:954:72: warning: incorrect type in argument 1 (different base types)
net/smc/smc_clc.c:954:72:    expected unsigned short [usertype] chid
net/smc/smc_clc.c:954:72:    got restricted __be16 [usertype] chid
net/smc/smc_clc.c:1050:29: warning: incorrect type in assignment (different base types)
net/smc/smc_clc.c:1050:29:    expected unsigned long long [usertype] gid
net/smc/smc_clc.c:1050:29:    got restricted __be64 [usertype]
net/smc/smc_clc.c:1051:31: warning: incorrect type in assignment (different base types)
net/smc/smc_clc.c:1051:31:    expected unsigned long long [usertype] token
net/smc/smc_clc.c:1051:31:    got restricted __be64 [usertype]


[1] linux/Documentation/process/coding-style.rst
[2] https://www.kernel.org/doc/html/v6.3/process/maintainer-netdev.html?highlight=network
[3] scripts/checkpatch.pl --strict --max-line-length=80 --git HEAD-18
[4] make C=2 CF=-D__CHECK_ENDIAN__ M=net/smc -Wunused-function -Wimplicit-fallthrough -Wincompatible-function-pointer-types-strict



When I installed the patches, I noticed that 
> smcd info
showed an SEID, even though I had no ISM device --> good


> smcd device
FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
0000 0                   0000   Yes       2

This needs some improvements.., but I'm not sure what is the best way to display virtual smcd interfaces in the smc-tools.


I was able to do SMC transfers via the smcd-loopback feature :-D
Alexandra Winter Sept. 28, 2023, 8:56 a.m. UTC | #3
On 27.09.23 17:16, Alexandra Winter wrote:
> Hello Wen Gu,
> 
> I applied and built your patches and noticed some things that you may want to consider in the next version:


FYI, patchwork basically complains about many the same issues:
https://patchwork.kernel.org/project/netdevbpf/list/?series=787037&state=*

In general you should run those check BEFORE you send the patches and not rely on patchwork.
Wen Gu Sept. 28, 2023, 4:42 p.m. UTC | #4
On 2023/9/27 23:16, Alexandra Winter wrote:
> 
> 
> On 24.09.23 17:16, Wen Gu wrote:
>> Wen Gu (18):
>>    net/smc: decouple ism_dev from SMC-D device dump
>>    net/smc: decouple ism_dev from SMC-D DMB registration
>>    net/smc: extract v2 check helper from SMC-D device registration
>>    net/smc: support SMCv2.x supplemental features negotiation
>>    net/smc: reserve CHID range for SMC-D virtual device
>>    net/smc: extend GID to 128bits only for virtual ISM device
>>    net/smc: disable SEID on non-s390 architecture
>>    net/smc: enable virtual ISM device feature bit
>>    net/smc: introduce SMC-D loopback device
>>    net/smc: implement ID-related operations of loopback
>>    net/smc: implement some unsupported operations of loopback
>>    net/smc: implement DMB-related operations of loopback
>>    net/smc: register loopback device as SMC-Dv2 device
>>    net/smc: add operation for getting DMB attribute
>>    net/smc: add operations for DMB attach and detach
>>    net/smc: avoid data copy from sndbuf to peer RMB in SMC-D
>>    net/smc: modify cursor update logic when sndbuf mapped to RMB
>>    net/smc: add interface implementation of loopback device
>>
>>   drivers/s390/net/ism_drv.c    |  20 +-
>>   include/net/smc.h             |  32 ++-
>>   include/uapi/linux/smc.h      |   3 +
>>   include/uapi/linux/smc_diag.h |   2 +
>>   net/smc/Kconfig               |  13 ++
>>   net/smc/Makefile              |   2 +-
>>   net/smc/af_smc.c              |  88 ++++++--
>>   net/smc/smc.h                 |   7 +
>>   net/smc/smc_cdc.c             |  56 ++++-
>>   net/smc/smc_cdc.h             |   1 +
>>   net/smc/smc_clc.c             |  64 ++++--
>>   net/smc/smc_clc.h             |  10 +-
>>   net/smc/smc_core.c            | 111 +++++++++-
>>   net/smc/smc_core.h            |   9 +-
>>   net/smc/smc_diag.c            |  11 +-
>>   net/smc/smc_ism.c             | 100 ++++++---
>>   net/smc/smc_ism.h             |  24 ++-
>>   net/smc/smc_loopback.c        | 489 ++++++++++++++++++++++++++++++++++++++++++
>>   net/smc/smc_loopback.h        |  54 +++++
>>   net/smc/smc_pnet.c            |   4 +-
>>   20 files changed, 996 insertions(+), 104 deletions(-)
>>   create mode 100644 net/smc/smc_loopback.c
>>   create mode 100644 net/smc/smc_loopback.h
> 
> 
> Hello Wen Gu,
> 
> I applied and built your patches and noticed some things that you may want to consider in the next version:
> 
> Series should be split up [2]
> 
> Several lines exceed 80 columns [1][3]
> 
> 'git clang-format HEAD~18' finds several formatting issues.
> 	Maybe not all of them need to be fixed.
> 
> codespell *.patch
> 0006-net-smc-extend-GID-to-128bits-only-for-virtual-ISM-d.patch:7: protocal ==> protocol
> 
> With your patches applied I get some new warnings [4]:
> Seems there are some ntoh conversions missing
> 
>    CHECK   net/smc/af_smc.c
> net/smc/af_smc.c:723:32: warning: cast to restricted __be64
> net/smc/af_smc.c:1427:52: warning: cast to restricted __be64
>    CHECK   net/smc/smc_pnet.c
>    CHECK   net/smc/smc_ib.c
>    CHECK   net/smc/smc_clc.c
> net/smc/smc_clc.c:954:72: warning: incorrect type in argument 1 (different base types)
> net/smc/smc_clc.c:954:72:    expected unsigned short [usertype] chid
> net/smc/smc_clc.c:954:72:    got restricted __be16 [usertype] chid
> net/smc/smc_clc.c:1050:29: warning: incorrect type in assignment (different base types)
> net/smc/smc_clc.c:1050:29:    expected unsigned long long [usertype] gid
> net/smc/smc_clc.c:1050:29:    got restricted __be64 [usertype]
> net/smc/smc_clc.c:1051:31: warning: incorrect type in assignment (different base types)
> net/smc/smc_clc.c:1051:31:    expected unsigned long long [usertype] token
> net/smc/smc_clc.c:1051:31:    got restricted __be64 [usertype]
> 
> 
> [1] linux/Documentation/process/coding-style.rst
> [2] https://www.kernel.org/doc/html/v6.3/process/maintainer-netdev.html?highlight=network
> [3] scripts/checkpatch.pl --strict --max-line-length=80 --git HEAD-18
> [4] make C=2 CF=-D__CHECK_ENDIAN__ M=net/smc -Wunused-function -Wimplicit-fallthrough -Wincompatible-function-pointer-types-strict
> 
> 

Hi Sandy,

Thank you very much for your detailed comments. They are really helpful.
I will check and fix them in my next version.

> 
> When I installed the patches, I noticed that
>> smcd info
> showed an SEID, even though I had no ISM device --> good
> 

Yes, virtual ISM device will return the right SEID (same as that returned
by ISM device) on s390 arch.

> 
>> smcd device
> FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
> 0000 0                   0000   Yes       2
> 
> This needs some improvements.., but I'm not sure what is the best way to display virtual smcd interfaces in the smc-tools.
> 

Right. My idea is to add the display of the device's name and GID
along with the PCI information (all zero) to indicate it's virtual
smcd device.

> 
> I was able to do SMC transfers via the smcd-loopback feature :-D
> 

Glad to hear this. I will improve the issues you mentioned in the next
version and next version will be split up.

Thank you very much.
Wen Gu
Wen Gu Sept. 28, 2023, 5:29 p.m. UTC | #5
On 2023/9/28 16:56, Alexandra Winter wrote:
> 
> 
> On 27.09.23 17:16, Alexandra Winter wrote:
>> Hello Wen Gu,
>>
>> I applied and built your patches and noticed some things that you may want to consider in the next version:
> 
> 
> FYI, patchwork basically complains about many the same issues:
> https://patchwork.kernel.org/project/netdevbpf/list/?series=787037&state=*
> 
> In general you should run those check BEFORE you send the patches and not rely on patchwork.
Thank you Sandy. I seem to have not seen the specific content of these checks. May I ask how to
run those patchwork check locally? So that I can make sure everything is ok before send them.
Alexandra Winter Sept. 29, 2023, 1:31 p.m. UTC | #6
On 28.09.23 19:29, Wen Gu wrote:
> 
> 
> On 2023/9/28 16:56, Alexandra Winter wrote:
>>
>>
>> On 27.09.23 17:16, Alexandra Winter wrote:
>>> Hello Wen Gu,
>>>
>>> I applied and built your patches and noticed some things that you may want to consider in the next version:
>>
>>
>> FYI, patchwork basically complains about many the same issues:
>> https://patchwork.kernel.org/project/netdevbpf/list/?series=787037&state=*
>>
>> In general you should run those check BEFORE you send the patches and not rely on patchwork.
> Thank you Sandy. I seem to have not seen the specific content of these checks. May I ask how to
> run those patchwork check locally? So that I can make sure everything is ok before send them.
> 

Citing from Documentation/process/maintainer-netdev.rst :

"patchwork checks
~~~~~~~~~~~~~~~~

Checks in patchwork are mostly simple wrappers around existing kernel
scripts, the sources are available at:

https://github.com/kuba-moo/nipa/tree/master/tests

**Do not** post your patches just to run them through the checks.
You must ensure that your patches are ready by testing them locally
before posting to the mailing list. The patchwork build bot instance
gets overloaded very easily and netdev@vger really doesn't need more
traffic if we can help it."

HTH
Wen Gu Oct. 4, 2023, 8:42 a.m. UTC | #7
On 2023/9/29 21:31, Alexandra Winter wrote:
> 
> 
> On 28.09.23 19:29, Wen Gu wrote:
>>
>>
>> On 2023/9/28 16:56, Alexandra Winter wrote:
>>>
>>>
>>> On 27.09.23 17:16, Alexandra Winter wrote:
>>>> Hello Wen Gu,
>>>>
>>>> I applied and built your patches and noticed some things that you may want to consider in the next version:
>>>
>>>
>>> FYI, patchwork basically complains about many the same issues:
>>> https://patchwork.kernel.org/project/netdevbpf/list/?series=787037&state=*
>>>
>>> In general you should run those check BEFORE you send the patches and not rely on patchwork.
>> Thank you Sandy. I seem to have not seen the specific content of these checks. May I ask how to
>> run those patchwork check locally? So that I can make sure everything is ok before send them.
>>
> 
> Citing from Documentation/process/maintainer-netdev.rst :
> 
> "patchwork checks
> ~~~~~~~~~~~~~~~~
> 
> Checks in patchwork are mostly simple wrappers around existing kernel
> scripts, the sources are available at:
> 
> https://github.com/kuba-moo/nipa/tree/master/tests
> 
> **Do not** post your patches just to run them through the checks.
> You must ensure that your patches are ready by testing them locally
> before posting to the mailing list. The patchwork build bot instance
> gets overloaded very easily and netdev@vger really doesn't need more
> traffic if we can help it."
> 
> HTH

Thank you! Sandy.
Niklas Schnelle Oct. 5, 2023, 8:21 a.m. UTC | #8
On Sun, 2023-09-24 at 23:16 +0800, Wen Gu wrote:
> Hi, all
> 
> # Background
> 
> SMC-D is now used in IBM z with ISM function to optimize network interconnect
> for intra-CPC communications. Inspired by this, we try to make SMC-D available
> on the non-s390 architecture through a software-simulated virtual ISM device,
> such as loopback-ism device here, to accelerate inter-process or inter-containers
> communication within the same OS.
> 
> # Design
> 
> This patch set includes 4 parts:
> 
>  - Patch #1-#3: decouple ISM device hard code from SMC-D stack.
>  - Patch #4-#8: implement virtual ISM extension defined in SMCv2.1.
>  - Patch #9-#13: implement loopback-ism device.
>  - Patch #14-#18: memory copy optimization for the case using loopback.
> 
> The loopback-ism device is designed as a kernel device and not be limited to
> a specific net namespace, ends of both inter-process connection (1/1' in diagram
> below) or inter-container connection (2/2' in diagram below) will find that peer
> shares the same loopback-ism device during the CLC handshake. Then loopback-ism
> device will be chosen.
> 
>  Container 1 (ns1)                              Container 2 (ns2)
>  +-----------------------------------------+    +-------------------------+
>  | +-------+      +-------+      +-------+ |    |        +-------+        |
>  | | App A |      | App B |      | App C | |    |        | App D |<-+     |
>  | +-------+      +---^---+      +-------+ |    |        +-------+  |(2') |
>  |     |127.0.0.1 (1')|             |192.168.0.11       192.168.0.12|     |
>  |  (1)|   +--------+ | +--------+  |(2)   |    | +--------+   +--------+ |
>  |     `-->|   lo   |-` |  eth0  |<-`      |    | |   lo   |   |  eth0  | |
>  +---------+--|---^-+---+-----|--+---------+    +-+--------+---+-^------+-+
>               |   |           |                                  |
>  Kernel       |   |           |                                  |
>  +----+-------v---+-----------v----------------------------------+---+----+
>  |    |                            TCP                               |    |
>  |    |                                                              |    |
>  |    +--------------------------------------------------------------+    |
>  |                                                                        |
>  |                           +--------------+                             |
>  |                           | smc loopback |                             |
>  +---------------------------+--------------+-----------------------------+
> 
> 
> loopback-ism device allocs RMBs and sndbufs for each connection peer and 'moves'
> data from sndbuf at one end to RMB at the other end. Since communication occurs
> within the same kernel, the sndbuf can be mapped to peer RMB so that the data
> copy in loopback-ism case can be avoided.
> 
>  Container 1 (ns1)                              Container 2 (ns2)
>  +-----------------------------------------+    +-------------------------+
>  | +-------+      +-------+      +-------+ |    |        +-------+        |
>  | | App A |      | App B |      | App C | |    |        | App D |        |
>  | +-------+      +--^----+      +-------+ |    |        +---^---+        |
>  |       |           |               |     |    |            |            |
>  |   (1) |      (1') |           (2) |     |    |       (2') |            |
>  |       |           |               |     |    |            |            |
>  +-------|-----------|---------------|-----+    +------------|------------+
>          |           |               |                       |
>  Kernel  |           |               |                       |
>  +-------|-----------|---------------|-----------------------|------------+
>  | +-----v-+      +-------+      +---v---+               +-------+        |
>  | | snd A |-+    | RMB B |<--+  | snd C |-+          +->| RMB D |        |
>  | +-------+ |    +-------+   |  +-------+ |          |  +-------+        |
>  | +-------+ |    +-------+   |  +-------+ |          |  +-------+        |
>  | | RMB A | |    | snd B |   |  | RMB C | |          |  | snd D |        |
>  | +-------+ |    +-------+   |  +-------+ |          |  +-------+        |
>  |           |               +-------------v+         |                   |
>  |           +-------------->| smc loopback |---------+                   |
>  +---------------------------+--------------+-----------------------------+
> 
> # Benchmark Test
> 
>  * Test environments:
>       - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>       - SMC sndbuf/RMB size 1MB.
> 
>  * Test object:
>       - TCP: run on TCP loopback.
>       - domain: run on UNIX domain.
>       - SMC lo: run on SMC loopback device.
> 
> 1. ipc-benchmark (see [1])
> 
>  - ./<foo> -c 1000000 -s 100
> 
>                             TCP                  SMC-lo
> Message
> rate (msg/s)              81539                  151251(+85.50%)
> 
> 2. sockperf
> 
>  - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>  - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
> 
>                             TCP                  SMC-lo
> Bandwidth(MBps)         5313.66                 8270.51(+55.65%)
> Latency(us)               5.806                   3.207(-44.76%)
> 
> 3. nginx/wrk
> 
>  - serv: <smc_run> nginx
>  - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
> 
>                            TCP                   SMC-lo
> Requests/s           194641.79                258656.13(+32.89%)
> 
> 4. redis-benchmark
> 
>  - serv: <smc_run> redis-server
>  - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
> 
>                            TCP                   SMC-lo
> GET(Requests/s)       85855.34                115640.35(+34.69%)
> SET(Requests/s)       86337.15                118203.30(+36.90%)
> 
> [1] https://github.com/goldsborough/ipc-bench
> 

Hi Wen Gu,

I've been trying out your series with iperf3, qperf, and uperf on
s390x. I'm using network namespaces with a ConnectX VF from the same
card in each namespace for the initial TCP/IP connection i.e. initially
it goes out to a real NIC even if that can switch internally. All of
these look great for streaming workloads both in terms of performance
and stability. With a Connect-Request-Response workload and uperf
however I've run into issues. The test configuration I use is as
follows:

Client Command:

# host=$ip_server ip netns exec client smc_run uperf -m tcp_crr.xml

Server Command:

# ip netns exec server smc_run uperf -s &> /dev/null

Uperf tcp_crr.xml:

<?xml version="1.0"?>
<profile name="TCP_CRR">
        <group nthreads="12">
                <transaction duration="120">
                        <flowop type="connect" options="remotehost=$host protocol=tcp" />
                        <flowop type="write" options="size=200"/>
                        <flowop type="read" options="size=1000"/>
                        <flowop type="disconnect" />
                </transaction>
        </group>
</profile>

The workload first runs fine but then after about 4 GB of data
transferred fails with "Connection refused" and "Connection reset by
peer" errors. The failure is not permanent however and re-running
the streaming workloads run fine again (with both uperf server and
client restarted). So I suspect something gets stuck in either the
client or server sockets. The same workload runs fine with TCP/IP of
course.

Thanks,
Niklas
Wen Gu Oct. 8, 2023, 7:19 a.m. UTC | #9
On 2023/10/5 16:21, Niklas Schnelle wrote:

> 
> Hi Wen Gu,
> 
> I've been trying out your series with iperf3, qperf, and uperf on
> s390x. I'm using network namespaces with a ConnectX VF from the same
> card in each namespace for the initial TCP/IP connection i.e. initially
> it goes out to a real NIC even if that can switch internally. All of
> these look great for streaming workloads both in terms of performance
> and stability. With a Connect-Request-Response workload and uperf
> however I've run into issues. The test configuration I use is as
> follows:
> 
> Client Command:
> 
> # host=$ip_server ip netns exec client smc_run uperf -m tcp_crr.xml
> 
> Server Command:
> 
> # ip netns exec server smc_run uperf -s &> /dev/null
> 
> Uperf tcp_crr.xml:
> 
> <?xml version="1.0"?>
> <profile name="TCP_CRR">
>          <group nthreads="12">
>                  <transaction duration="120">
>                          <flowop type="connect" options="remotehost=$host protocol=tcp" />
>                          <flowop type="write" options="size=200"/>
>                          <flowop type="read" options="size=1000"/>
>                          <flowop type="disconnect" />
>                  </transaction>
>          </group>
> </profile>
> 
> The workload first runs fine but then after about 4 GB of data
> transferred fails with "Connection refused" and "Connection reset by
> peer" errors. The failure is not permanent however and re-running
> the streaming workloads run fine again (with both uperf server and
> client restarted). So I suspect something gets stuck in either the
> client or server sockets. The same workload runs fine with TCP/IP of
> course.
> 
> Thanks,
> Niklas
> 
> 

Hi Niklas,

Thank you very much for the test. With the test example you provided, I've
reproduced the issue in my VM. And moreover, sometimes the test complains
with 'Error saying goodbye with <ip>'

I'll figure out what's going on here.

Thanks!
Wen Gu
Wen Gu Oct. 17, 2023, 3:49 a.m. UTC | #10
On 2023/10/8 15:19, Wen Gu wrote:
> 
> 
> On 2023/10/5 16:21, Niklas Schnelle wrote:
> 
>>
>> Hi Wen Gu,
>>
>> I've been trying out your series with iperf3, qperf, and uperf on
>> s390x. I'm using network namespaces with a ConnectX VF from the same
>> card in each namespace for the initial TCP/IP connection i.e. initially
>> it goes out to a real NIC even if that can switch internally. All of
>> these look great for streaming workloads both in terms of performance
>> and stability. With a Connect-Request-Response workload and uperf
>> however I've run into issues. The test configuration I use is as
>> follows:
>>
>> Client Command:
>>
>> # host=$ip_server ip netns exec client smc_run uperf -m tcp_crr.xml
>>
>> Server Command:
>>
>> # ip netns exec server smc_run uperf -s &> /dev/null
>>
>> Uperf tcp_crr.xml:
>>
>> <?xml version="1.0"?>
>> <profile name="TCP_CRR">
>>          <group nthreads="12">
>>                  <transaction duration="120">
>>                          <flowop type="connect" options="remotehost=$host protocol=tcp" />
>>                          <flowop type="write" options="size=200"/>
>>                          <flowop type="read" options="size=1000"/>
>>                          <flowop type="disconnect" />
>>                  </transaction>
>>          </group>
>> </profile>
>>
>> The workload first runs fine but then after about 4 GB of data
>> transferred fails with "Connection refused" and "Connection reset by
>> peer" errors. The failure is not permanent however and re-running
>> the streaming workloads run fine again (with both uperf server and
>> client restarted). So I suspect something gets stuck in either the
>> client or server sockets. The same workload runs fine with TCP/IP of
>> course.
>>
>> Thanks,
>> Niklas
>>
>>
> 
> Hi Niklas,
> 
> Thank you very much for the test. With the test example you provided, I've
> reproduced the issue in my VM. And moreover, sometimes the test complains
> with 'Error saying goodbye with <ip>'
> 
> I'll figure out what's going on here.
> 
> Thanks!
> Wen Gu

I think that there is a common issue for SMC-R and SMC-D. I also reproduce
'connection reset by peer' and 'Error saying goodbye with <ip>' when using
SMC-R under the same test condition. They occur at the end of the test.

When the uperf test time ends, some signals are sent. At this point there
are usually some SMC connections doing CLC handshake. I catch some -EINTR(-4)
in client and -ECONNRESET(-104) in server returned from smc_clc_wait_msg,
(correspondingly handshake error counts also increase) and TCP RST packets
sent to terminate the CLC TCP connection(clcsock).

I am not sure if this should be considered as a bydesign or a bug of SMC.
 From an application perspective, the conn reset behavior only happens when
using SMC.

@Wenjia, could you please take a look at this?

Thanks,
Wen Gu
Wenjia Zhang Oct. 18, 2023, 7:43 p.m. UTC | #11
On 17.10.23 05:49, Wen Gu wrote:
> 
> 
> On 2023/10/8 15:19, Wen Gu wrote:
>>
>>
>> On 2023/10/5 16:21, Niklas Schnelle wrote:
>>
>>>
>>> Hi Wen Gu,
>>>
>>> I've been trying out your series with iperf3, qperf, and uperf on
>>> s390x. I'm using network namespaces with a ConnectX VF from the same
>>> card in each namespace for the initial TCP/IP connection i.e. initially
>>> it goes out to a real NIC even if that can switch internally. All of
>>> these look great for streaming workloads both in terms of performance
>>> and stability. With a Connect-Request-Response workload and uperf
>>> however I've run into issues. The test configuration I use is as
>>> follows:
>>>
>>> Client Command:
>>>
>>> # host=$ip_server ip netns exec client smc_run uperf -m tcp_crr.xml
>>>
>>> Server Command:
>>>
>>> # ip netns exec server smc_run uperf -s &> /dev/null
>>>
>>> Uperf tcp_crr.xml:
>>>
>>> <?xml version="1.0"?>
>>> <profile name="TCP_CRR">
>>>          <group nthreads="12">
>>>                  <transaction duration="120">
>>>                          <flowop type="connect" 
>>> options="remotehost=$host protocol=tcp" />
>>>                          <flowop type="write" options="size=200"/>
>>>                          <flowop type="read" options="size=1000"/>
>>>                          <flowop type="disconnect" />
>>>                  </transaction>
>>>          </group>
>>> </profile>
>>>
>>> The workload first runs fine but then after about 4 GB of data
>>> transferred fails with "Connection refused" and "Connection reset by
>>> peer" errors. The failure is not permanent however and re-running
>>> the streaming workloads run fine again (with both uperf server and
>>> client restarted). So I suspect something gets stuck in either the
>>> client or server sockets. The same workload runs fine with TCP/IP of
>>> course.
>>>
>>> Thanks,
>>> Niklas
>>>
>>>
>>
>> Hi Niklas,
>>
>> Thank you very much for the test. With the test example you provided, 
>> I've
>> reproduced the issue in my VM. And moreover, sometimes the test complains
>> with 'Error saying goodbye with <ip>'
>>
>> I'll figure out what's going on here.
>>
>> Thanks!
>> Wen Gu
> 
> I think that there is a common issue for SMC-R and SMC-D. I also reproduce
> 'connection reset by peer' and 'Error saying goodbye with <ip>' when using
> SMC-R under the same test condition. They occur at the end of the test.
> 
> When the uperf test time ends, some signals are sent. At this point there
> are usually some SMC connections doing CLC handshake. I catch some 
> -EINTR(-4)
> in client and -ECONNRESET(-104) in server returned from smc_clc_wait_msg,
> (correspondingly handshake error counts also increase) and TCP RST packets
> sent to terminate the CLC TCP connection(clcsock).
> 
> I am not sure if this should be considered as a bydesign or a bug of SMC.
>  From an application perspective, the conn reset behavior only happens when
> using SMC.
> 
> @Wenjia, could you please take a look at this?
> 
> Thanks,
> Wen Gu

Hi Wen,

Do you mean the bug in smc_clc_wait_msg()?
If yes, I can not see any problem in the smc_clc_wait_msg(). From your 
description, it looks to me like the server should get the CLC_PROPOSAL 
message, but nothing in it while the client is waiting for the accept 
CLC_ACCEPT message from the server until the wait loops is broken out.

Thanks,
Wenjia