[0/6] refactor RDMA live migration based on rsocket API

Message ID	1717503252-51884-1-git-send-email-arei.gonglei@huawei.com (mailing list archive)
Headers	show Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E17D5145A1D for <linux-rdma@vger.kernel.org>; Tue, 4 Jun 2024 12:14:21 +0000 (UTC) From: Gonglei <arei.gonglei@huawei.com> To: <qemu-devel@nongnu.org> CC: <peterx@redhat.com>, <yu.zhang@ionos.com>, <mgalaxy@akamai.com>, <elmar.gerdes@ionos.com>, <zhengchuan@huawei.com>, <berrange@redhat.com>, <armbru@redhat.com>, <lizhijian@fujitsu.com>, <pbonzini@redhat.com>, <mst@redhat.com>, <xiexiangyou@huawei.com>, <linux-rdma@vger.kernel.org>, <lixiao91@huawei.com>, <arei.gonglei@huawei.com>, <jinpu.wang@ionos.com>, Jialin Wang <wangjialin23@huawei.com> Subject: [PATCH 0/6] refactor RDMA live migration based on rsocket API Date: Tue, 4 Jun 2024 20:14:06 +0800 Message-ID: <1717503252-51884-1-git-send-email-arei.gonglei@huawei.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain
Series	refactor RDMA live migration based on rsocket API \| expand [0/6] refactor RDMA live migration based on rsocket API [1/6] migration: remove RDMA live migration temporarily [2/6] io: add QIOChannelRDMA class [3/6] io/channel-rdma: support working in coroutine [4/6] tests/unit: add test-io-channel-rdma.c [5/6] migration: introduce new RDMA live migration [6/6] migration/rdma: support multifd for RDMA migration

Gonglei (Arei) June 4, 2024, 12:14 p.m. UTC

From: Jialin Wang <wangjialin23@huawei.com>

Hi,

This patch series attempts to refactor RDMA live migration by
introducing a new QIOChannelRDMA class based on the rsocket API.

The /usr/include/rdma/rsocket.h provides a higher level rsocket API
that is a 1-1 match of the normal kernel 'sockets' API, which hides the
detail of rdma protocol into rsocket and allows us to add support for
some modern features like multifd more easily.

Here is the previous discussion on refactoring RDMA live migration using
the rsocket API:

https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linaro.org/

We have encountered some bugs when using rsocket and plan to submit them to
the rdma-core community.

In addition, the use of rsocket makes our programming more convenient,
but it must be noted that this method introduces multiple memory copies,
which can be imagined that there will be a certain performance degradation,
hoping that friends with RDMA network cards can help verify, thank you!

Jialin Wang (6):
  migration: remove RDMA live migration temporarily
  io: add QIOChannelRDMA class
  io/channel-rdma: support working in coroutine
  tests/unit: add test-io-channel-rdma.c
  migration: introduce new RDMA live migration
  migration/rdma: support multifd for RDMA migration

 docs/rdma.txt                     |  420 ---
 include/io/channel-rdma.h         |  165 ++
 io/channel-rdma.c                 |  798 ++++++
 io/meson.build                    |    1 +
 io/trace-events                   |   14 +
 meson.build                       |    6 -
 migration/meson.build             |    3 +-
 migration/migration-stats.c       |    5 +-
 migration/migration-stats.h       |    4 -
 migration/migration.c             |   13 +-
 migration/migration.h             |    9 -
 migration/multifd.c               |   10 +
 migration/options.c               |   16 -
 migration/options.h               |    2 -
 migration/qemu-file.c             |    1 -
 migration/ram.c                   |   90 +-
 migration/rdma.c                  | 4205 +----------------------------
 migration/rdma.h                  |   67 +-
 migration/savevm.c                |    2 +-
 migration/trace-events            |   68 +-
 qapi/migration.json               |   13 +-
 scripts/analyze-migration.py      |    3 -
 tests/unit/meson.build            |    1 +
 tests/unit/test-io-channel-rdma.c |  276 ++
 24 files changed, 1360 insertions(+), 4832 deletions(-)
 delete mode 100644 docs/rdma.txt
 create mode 100644 include/io/channel-rdma.h
 create mode 100644 io/channel-rdma.c
 create mode 100644 tests/unit/test-io-channel-rdma.c

Peter Xu June 4, 2024, 7:32 p.m. UTC | #1

Hi, Lei, Jialin,

Thanks a lot for working on this!

I think we'll need to wait a bit on feedbacks from Jinpu and his team on
RDMA side, also Daniel for iochannels.  Also, please remember to copy
Fabiano Rosas in any relevant future posts.  We'd also like to know whether
he has any comments too.  I have him copied in this reply.

On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> From: Jialin Wang <wangjialin23@huawei.com>
> 
> Hi,
> 
> This patch series attempts to refactor RDMA live migration by
> introducing a new QIOChannelRDMA class based on the rsocket API.
> 
> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> that is a 1-1 match of the normal kernel 'sockets' API, which hides the
> detail of rdma protocol into rsocket and allows us to add support for
> some modern features like multifd more easily.
> 
> Here is the previous discussion on refactoring RDMA live migration using
> the rsocket API:
> 
> https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linaro.org/
> 
> We have encountered some bugs when using rsocket and plan to submit them to
> the rdma-core community.
> 
> In addition, the use of rsocket makes our programming more convenient,
> but it must be noted that this method introduces multiple memory copies,
> which can be imagined that there will be a certain performance degradation,
> hoping that friends with RDMA network cards can help verify, thank you!

It'll be good to elaborate if you tested it in-house. What people should
expect on the numbers exactly?  Is that okay from Huawei's POV?

Besides that, the code looks pretty good at a first glance to me.  Before
others chim in, here're some high level comments..

Firstly, can we avoid using coroutine when listen()?  Might be relevant
when I see that rdma_accept_incoming_migration() runs in a loop to do
raccept(), but would that also hang the qemu main loop even with the
coroutine, before all channels are ready?  I'm not a coroutine person, but
I think the hope is that we can make dest QEMU run in a thread in the
future just like the src QEMU, so the less coroutine the better in this
path.

I think I also left a comment elsewhere on whether it would be possible to
allow iochannels implement their own poll() functions to avoid the
per-channel poll thread that is proposed in this series.

https://lore.kernel.org/r/ZldY21xVExtiMddB@x1n

Personally I think even with the thread proposal it's better than the old
rdma code, but I just still want to double check with you guys.  E.g.,
maybe that just won't work at all?  Again, that'll also be based on the
fact that we move migration incoming into a thread first to keep the dest
QEMU main loop intact, I think, but I hope we will reach that irrelevant of
rdma, IOW it'll be nice to happen even earlier if possible.

Another nitpick is that qio_channel_rdma_listen_async() doesn't look used
and may prone to removal.

Thanks,

Michael S. Tsirkin June 5, 2024, 7:57 a.m. UTC | #2

On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> From: Jialin Wang <wangjialin23@huawei.com>
> 
> Hi,
> 
> This patch series attempts to refactor RDMA live migration by
> introducing a new QIOChannelRDMA class based on the rsocket API.
> 
> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> that is a 1-1 match of the normal kernel 'sockets' API, which hides the
> detail of rdma protocol into rsocket and allows us to add support for
> some modern features like multifd more easily.
> 
> Here is the previous discussion on refactoring RDMA live migration using
> the rsocket API:
> 
> https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linaro.org/
> 
> We have encountered some bugs when using rsocket and plan to submit them to
> the rdma-core community.
> 
> In addition, the use of rsocket makes our programming more convenient,
> but it must be noted that this method introduces multiple memory copies,
> which can be imagined that there will be a certain performance degradation,
> hoping that friends with RDMA network cards can help verify, thank you!

So you didn't test it with an RDMA card?
You really should test with an RDMA card though, for correctness
as much as performance.


> Jialin Wang (6):
>   migration: remove RDMA live migration temporarily
>   io: add QIOChannelRDMA class
>   io/channel-rdma: support working in coroutine
>   tests/unit: add test-io-channel-rdma.c
>   migration: introduce new RDMA live migration
>   migration/rdma: support multifd for RDMA migration
> 
>  docs/rdma.txt                     |  420 ---
>  include/io/channel-rdma.h         |  165 ++
>  io/channel-rdma.c                 |  798 ++++++
>  io/meson.build                    |    1 +
>  io/trace-events                   |   14 +
>  meson.build                       |    6 -
>  migration/meson.build             |    3 +-
>  migration/migration-stats.c       |    5 +-
>  migration/migration-stats.h       |    4 -
>  migration/migration.c             |   13 +-
>  migration/migration.h             |    9 -
>  migration/multifd.c               |   10 +
>  migration/options.c               |   16 -
>  migration/options.h               |    2 -
>  migration/qemu-file.c             |    1 -
>  migration/ram.c                   |   90 +-
>  migration/rdma.c                  | 4205 +----------------------------
>  migration/rdma.h                  |   67 +-
>  migration/savevm.c                |    2 +-
>  migration/trace-events            |   68 +-
>  qapi/migration.json               |   13 +-
>  scripts/analyze-migration.py      |    3 -
>  tests/unit/meson.build            |    1 +
>  tests/unit/test-io-channel-rdma.c |  276 ++
>  24 files changed, 1360 insertions(+), 4832 deletions(-)
>  delete mode 100644 docs/rdma.txt
>  create mode 100644 include/io/channel-rdma.h
>  create mode 100644 io/channel-rdma.c
>  create mode 100644 tests/unit/test-io-channel-rdma.c
> 
> -- 
> 2.43.0

Gonglei (Arei) June 5, 2024, 10 a.m. UTC | #3

> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com]
> Sent: Wednesday, June 5, 2024 3:57 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
> mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
> <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> lizhijian@fujitsu.com; pbonzini@redhat.com; Xiexiangyou
> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> <wangjialin23@huawei.com>
> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> 
> On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > From: Jialin Wang <wangjialin23@huawei.com>
> >
> > Hi,
> >
> > This patch series attempts to refactor RDMA live migration by
> > introducing a new QIOChannelRDMA class based on the rsocket API.
> >
> > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > the detail of rdma protocol into rsocket and allows us to add support
> > for some modern features like multifd more easily.
> >
> > Here is the previous discussion on refactoring RDMA live migration
> > using the rsocket API:
> >
> > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > o.org/
> >
> > We have encountered some bugs when using rsocket and plan to submit
> > them to the rdma-core community.
> >
> > In addition, the use of rsocket makes our programming more convenient,
> > but it must be noted that this method introduces multiple memory
> > copies, which can be imagined that there will be a certain performance
> > degradation, hoping that friends with RDMA network cards can help verify,
> thank you!
> 
> So you didn't test it with an RDMA card?

Yep, we tested it by Soft-ROCE.

> You really should test with an RDMA card though, for correctness as much as
> performance.
> 
We will, we just don't have RDMA cards environment on hand at the moment.

Regards,
-Gonglei

> 
> > Jialin Wang (6):
> >   migration: remove RDMA live migration temporarily
> >   io: add QIOChannelRDMA class
> >   io/channel-rdma: support working in coroutine
> >   tests/unit: add test-io-channel-rdma.c
> >   migration: introduce new RDMA live migration
> >   migration/rdma: support multifd for RDMA migration
> >
> >  docs/rdma.txt                     |  420 ---
> >  include/io/channel-rdma.h         |  165 ++
> >  io/channel-rdma.c                 |  798 ++++++
> >  io/meson.build                    |    1 +
> >  io/trace-events                   |   14 +
> >  meson.build                       |    6 -
> >  migration/meson.build             |    3 +-
> >  migration/migration-stats.c       |    5 +-
> >  migration/migration-stats.h       |    4 -
> >  migration/migration.c             |   13 +-
> >  migration/migration.h             |    9 -
> >  migration/multifd.c               |   10 +
> >  migration/options.c               |   16 -
> >  migration/options.h               |    2 -
> >  migration/qemu-file.c             |    1 -
> >  migration/ram.c                   |   90 +-
> >  migration/rdma.c                  | 4205 +----------------------------
> >  migration/rdma.h                  |   67 +-
> >  migration/savevm.c                |    2 +-
> >  migration/trace-events            |   68 +-
> >  qapi/migration.json               |   13 +-
> >  scripts/analyze-migration.py      |    3 -
> >  tests/unit/meson.build            |    1 +
> >  tests/unit/test-io-channel-rdma.c |  276 ++
> >  24 files changed, 1360 insertions(+), 4832 deletions(-)  delete mode
> > 100644 docs/rdma.txt  create mode 100644 include/io/channel-rdma.h
> > create mode 100644 io/channel-rdma.c  create mode 100644
> > tests/unit/test-io-channel-rdma.c
> >
> > --
> > 2.43.0

Gonglei (Arei) June 5, 2024, 10:09 a.m. UTC | #4

Hi Peter,

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, June 5, 2024 3:32 AM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: qemu-devel@nongnu.org; yu.zhang@ionos.com; mgalaxy@akamai.com;
> elmar.gerdes@ionos.com; zhengchuan <zhengchuan@huawei.com>;
> berrange@redhat.com; armbru@redhat.com; lizhijian@fujitsu.com;
> pbonzini@redhat.com; mst@redhat.com; Xiexiangyou
> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> <wangjialin23@huawei.com>; Fabiano Rosas <farosas@suse.de>
> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> 
> Hi, Lei, Jialin,
> 
> Thanks a lot for working on this!
> 
> I think we'll need to wait a bit on feedbacks from Jinpu and his team on RDMA
> side, also Daniel for iochannels.  Also, please remember to copy Fabiano
> Rosas in any relevant future posts.  We'd also like to know whether he has any
> comments too.  I have him copied in this reply.
> 
> On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > From: Jialin Wang <wangjialin23@huawei.com>
> >
> > Hi,
> >
> > This patch series attempts to refactor RDMA live migration by
> > introducing a new QIOChannelRDMA class based on the rsocket API.
> >
> > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > the detail of rdma protocol into rsocket and allows us to add support
> > for some modern features like multifd more easily.
> >
> > Here is the previous discussion on refactoring RDMA live migration
> > using the rsocket API:
> >
> > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > o.org/
> >
> > We have encountered some bugs when using rsocket and plan to submit
> > them to the rdma-core community.
> >
> > In addition, the use of rsocket makes our programming more convenient,
> > but it must be noted that this method introduces multiple memory
> > copies, which can be imagined that there will be a certain performance
> > degradation, hoping that friends with RDMA network cards can help verify,
> thank you!
> 
> It'll be good to elaborate if you tested it in-house. What people should expect
> on the numbers exactly?  Is that okay from Huawei's POV?
> 
> Besides that, the code looks pretty good at a first glance to me.  Before
> others chim in, here're some high level comments..
> 
> Firstly, can we avoid using coroutine when listen()?  Might be relevant when I
> see that rdma_accept_incoming_migration() runs in a loop to do raccept(), but
> would that also hang the qemu main loop even with the coroutine, before all
> channels are ready?  I'm not a coroutine person, but I think the hope is that
> we can make dest QEMU run in a thread in the future just like the src QEMU, so
> the less coroutine the better in this path.
> 

Because rsocket is set to non-blocking, raccept will return EAGAIN when no connection 
is received, coroutine will yield, and will not hang the qemu main loop.

> I think I also left a comment elsewhere on whether it would be possible to allow
> iochannels implement their own poll() functions to avoid the per-channel poll
> thread that is proposed in this series.
> 
> https://lore.kernel.org/r/ZldY21xVExtiMddB@x1n
> 

We noticed that, and it's a big operation. I'm not sure that's a better way.

> Personally I think even with the thread proposal it's better than the old rdma
> code, but I just still want to double check with you guys.  E.g., maybe that just
> won't work at all?  Again, that'll also be based on the fact that we move
> migration incoming into a thread first to keep the dest QEMU main loop intact,
> I think, but I hope we will reach that irrelevant of rdma, IOW it'll be nice to
> happen even earlier if possible.
> 
Yep. This is a fairly big change, I wonder what other people's suggestions are?

> Another nitpick is that qio_channel_rdma_listen_async() doesn't look used and
> may prone to removal.
> 

Yes. This is because when we wrote the test case, we wanted to test qio_channel_rdma_connect_async, 
and also I added qio_channel_rdma_listen_async. It is not used in the RDMA hot migration code.

Regards,
-Gonglei

Michael S. Tsirkin June 5, 2024, 10:23 a.m. UTC | #5

On Wed, Jun 05, 2024 at 10:00:24AM +0000, Gonglei (Arei) wrote:
> 
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Wednesday, June 5, 2024 3:57 PM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
> > mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
> > <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> > lizhijian@fujitsu.com; pbonzini@redhat.com; Xiexiangyou
> > <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> > <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> > <wangjialin23@huawei.com>
> > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> > 
> > On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > > From: Jialin Wang <wangjialin23@huawei.com>
> > >
> > > Hi,
> > >
> > > This patch series attempts to refactor RDMA live migration by
> > > introducing a new QIOChannelRDMA class based on the rsocket API.
> > >
> > > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > > the detail of rdma protocol into rsocket and allows us to add support
> > > for some modern features like multifd more easily.
> > >
> > > Here is the previous discussion on refactoring RDMA live migration
> > > using the rsocket API:
> > >
> > > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > > o.org/
> > >
> > > We have encountered some bugs when using rsocket and plan to submit
> > > them to the rdma-core community.
> > >
> > > In addition, the use of rsocket makes our programming more convenient,
> > > but it must be noted that this method introduces multiple memory
> > > copies, which can be imagined that there will be a certain performance
> > > degradation, hoping that friends with RDMA network cards can help verify,
> > thank you!
> > 
> > So you didn't test it with an RDMA card?
> 
> Yep, we tested it by Soft-ROCE.
> 
> > You really should test with an RDMA card though, for correctness as much as
> > performance.
> > 
> We will, we just don't have RDMA cards environment on hand at the moment.
> 
> Regards,
> -Gonglei

Until it's tested on real hardware it is probably best to tag this
series as RFC in the subject.

> > 
> > > Jialin Wang (6):
> > >   migration: remove RDMA live migration temporarily
> > >   io: add QIOChannelRDMA class
> > >   io/channel-rdma: support working in coroutine
> > >   tests/unit: add test-io-channel-rdma.c
> > >   migration: introduce new RDMA live migration
> > >   migration/rdma: support multifd for RDMA migration
> > >
> > >  docs/rdma.txt                     |  420 ---
> > >  include/io/channel-rdma.h         |  165 ++
> > >  io/channel-rdma.c                 |  798 ++++++
> > >  io/meson.build                    |    1 +
> > >  io/trace-events                   |   14 +
> > >  meson.build                       |    6 -
> > >  migration/meson.build             |    3 +-
> > >  migration/migration-stats.c       |    5 +-
> > >  migration/migration-stats.h       |    4 -
> > >  migration/migration.c             |   13 +-
> > >  migration/migration.h             |    9 -
> > >  migration/multifd.c               |   10 +
> > >  migration/options.c               |   16 -
> > >  migration/options.h               |    2 -
> > >  migration/qemu-file.c             |    1 -
> > >  migration/ram.c                   |   90 +-
> > >  migration/rdma.c                  | 4205 +----------------------------
> > >  migration/rdma.h                  |   67 +-
> > >  migration/savevm.c                |    2 +-
> > >  migration/trace-events            |   68 +-
> > >  qapi/migration.json               |   13 +-
> > >  scripts/analyze-migration.py      |    3 -
> > >  tests/unit/meson.build            |    1 +
> > >  tests/unit/test-io-channel-rdma.c |  276 ++
> > >  24 files changed, 1360 insertions(+), 4832 deletions(-)  delete mode
> > > 100644 docs/rdma.txt  create mode 100644 include/io/channel-rdma.h
> > > create mode 100644 io/channel-rdma.c  create mode 100644
> > > tests/unit/test-io-channel-rdma.c
> > >
> > > --
> > > 2.43.0

Peter Xu June 5, 2024, 2:18 p.m. UTC | #6

On Wed, Jun 05, 2024 at 10:09:43AM +0000, Gonglei (Arei) wrote:
> Hi Peter,
> 
> > -----Original Message-----
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Wednesday, June 5, 2024 3:32 AM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: qemu-devel@nongnu.org; yu.zhang@ionos.com; mgalaxy@akamai.com;
> > elmar.gerdes@ionos.com; zhengchuan <zhengchuan@huawei.com>;
> > berrange@redhat.com; armbru@redhat.com; lizhijian@fujitsu.com;
> > pbonzini@redhat.com; mst@redhat.com; Xiexiangyou
> > <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> > <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> > <wangjialin23@huawei.com>; Fabiano Rosas <farosas@suse.de>
> > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> > 
> > Hi, Lei, Jialin,
> > 
> > Thanks a lot for working on this!
> > 
> > I think we'll need to wait a bit on feedbacks from Jinpu and his team on RDMA
> > side, also Daniel for iochannels.  Also, please remember to copy Fabiano
> > Rosas in any relevant future posts.  We'd also like to know whether he has any
> > comments too.  I have him copied in this reply.
> > 
> > On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > > From: Jialin Wang <wangjialin23@huawei.com>
> > >
> > > Hi,
> > >
> > > This patch series attempts to refactor RDMA live migration by
> > > introducing a new QIOChannelRDMA class based on the rsocket API.
> > >
> > > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > > the detail of rdma protocol into rsocket and allows us to add support
> > > for some modern features like multifd more easily.
> > >
> > > Here is the previous discussion on refactoring RDMA live migration
> > > using the rsocket API:
> > >
> > > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > > o.org/
> > >
> > > We have encountered some bugs when using rsocket and plan to submit
> > > them to the rdma-core community.
> > >
> > > In addition, the use of rsocket makes our programming more convenient,
> > > but it must be noted that this method introduces multiple memory
> > > copies, which can be imagined that there will be a certain performance
> > > degradation, hoping that friends with RDMA network cards can help verify,
> > thank you!
> > 
> > It'll be good to elaborate if you tested it in-house. What people should expect
> > on the numbers exactly?  Is that okay from Huawei's POV?
> > 
> > Besides that, the code looks pretty good at a first glance to me.  Before
> > others chim in, here're some high level comments..
> > 
> > Firstly, can we avoid using coroutine when listen()?  Might be relevant when I
> > see that rdma_accept_incoming_migration() runs in a loop to do raccept(), but
> > would that also hang the qemu main loop even with the coroutine, before all
> > channels are ready?  I'm not a coroutine person, but I think the hope is that
> > we can make dest QEMU run in a thread in the future just like the src QEMU, so
> > the less coroutine the better in this path.
> > 
> 
> Because rsocket is set to non-blocking, raccept will return EAGAIN when no connection 
> is received, coroutine will yield, and will not hang the qemu main loop.

Ah that's ok.  And also I just noticed it may not be a big deal either as
long as we're before migration_incoming_process().

I'm wondering whether it can do it similarly like what we do with sockets
in qio_net_listener_set_client_func_full().  After all, rsocket wants to
mimic the socket API.  It'll make sense if rsocket code tries to match with
socket, or even reuse.

> 
> > I think I also left a comment elsewhere on whether it would be possible to allow
> > iochannels implement their own poll() functions to avoid the per-channel poll
> > thread that is proposed in this series.
> > 
> > https://lore.kernel.org/r/ZldY21xVExtiMddB@x1n
> > 
> 
> We noticed that, and it's a big operation. I'm not sure that's a better way.
> 
> > Personally I think even with the thread proposal it's better than the old rdma
> > code, but I just still want to double check with you guys.  E.g., maybe that just
> > won't work at all?  Again, that'll also be based on the fact that we move
> > migration incoming into a thread first to keep the dest QEMU main loop intact,
> > I think, but I hope we will reach that irrelevant of rdma, IOW it'll be nice to
> > happen even earlier if possible.
> > 
> Yep. This is a fairly big change, I wonder what other people's suggestions are?

Yes we can wait for others' opinions.  And btw I'm not asking for it and I
don't think it'll be a blocker for this approach to land, as I said this is
better than the current code so it's definitely an improvement to me.

I'm purely curious, because if you're not going to do it for rdma, maybe
someday I'll try to do that, and I want to know what "big change" could be
as I didn't dig further.  It may help me by sharing what issues you've found.

Thanks,

Leon Romanovsky June 6, 2024, 11:31 a.m. UTC | #7

On Wed, Jun 05, 2024 at 10:00:24AM +0000, Gonglei (Arei) wrote:
> 
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Wednesday, June 5, 2024 3:57 PM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
> > mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
> > <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> > lizhijian@fujitsu.com; pbonzini@redhat.com; Xiexiangyou
> > <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> > <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> > <wangjialin23@huawei.com>
> > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> > 
> > On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > > From: Jialin Wang <wangjialin23@huawei.com>
> > >
> > > Hi,
> > >
> > > This patch series attempts to refactor RDMA live migration by
> > > introducing a new QIOChannelRDMA class based on the rsocket API.
> > >
> > > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > > the detail of rdma protocol into rsocket and allows us to add support
> > > for some modern features like multifd more easily.
> > >
> > > Here is the previous discussion on refactoring RDMA live migration
> > > using the rsocket API:
> > >
> > > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > > o.org/
> > >
> > > We have encountered some bugs when using rsocket and plan to submit
> > > them to the rdma-core community.
> > >
> > > In addition, the use of rsocket makes our programming more convenient,
> > > but it must be noted that this method introduces multiple memory
> > > copies, which can be imagined that there will be a certain performance
> > > degradation, hoping that friends with RDMA network cards can help verify,
> > thank you!
> > 
> > So you didn't test it with an RDMA card?
> 
> Yep, we tested it by Soft-ROCE.

Does Soft-RoCE (RXE) support live migration?

Thanks

Zhijian Li (Fujitsu) June 7, 2024, 1:04 a.m. UTC | #8

On 06/06/2024 19:31, Leon Romanovsky wrote:
> On Wed, Jun 05, 2024 at 10:00:24AM +0000, Gonglei (Arei) wrote:
>>
>>
>>> -----Original Message-----
>>> From: Michael S. Tsirkin [mailto:mst@redhat.com]
>>> Sent: Wednesday, June 5, 2024 3:57 PM
>>> To: Gonglei (Arei) <arei.gonglei@huawei.com>
>>> Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
>>> mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
>>> <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
>>> lizhijian@fujitsu.com; pbonzini@redhat.com; Xiexiangyou
>>> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
>>> <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
>>> <wangjialin23@huawei.com>
>>> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
>>>
>>> On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
>>>> From: Jialin Wang <wangjialin23@huawei.com>
>>>>
>>>> Hi,
>>>>
>>>> This patch series attempts to refactor RDMA live migration by
>>>> introducing a new QIOChannelRDMA class based on the rsocket API.
>>>>
>>>> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
>>>> that is a 1-1 match of the normal kernel 'sockets' API, which hides
>>>> the detail of rdma protocol into rsocket and allows us to add support
>>>> for some modern features like multifd more easily.
>>>>
>>>> Here is the previous discussion on refactoring RDMA live migration
>>>> using the rsocket API:
>>>>
>>>> https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
>>>> o.org/
>>>>
>>>> We have encountered some bugs when using rsocket and plan to submit
>>>> them to the rdma-core community.
>>>>
>>>> In addition, the use of rsocket makes our programming more convenient,
>>>> but it must be noted that this method introduces multiple memory
>>>> copies, which can be imagined that there will be a certain performance
>>>> degradation, hoping that friends with RDMA network cards can help verify,
>>> thank you!
>>>
>>> So you didn't test it with an RDMA card?
>>
>> Yep, we tested it by Soft-ROCE.
> 
> Does Soft-RoCE (RXE) support live migration?


Yes, it does


Thanks
Zhijian

> 
> Thanks
>

Jinpu Wang June 7, 2024, 5:53 a.m. UTC | #9

Hi Gonglei, hi folks on the list,

On Tue, Jun 4, 2024 at 2:14 PM Gonglei <arei.gonglei@huawei.com> wrote:
>
> From: Jialin Wang <wangjialin23@huawei.com>
>
> Hi,
>
> This patch series attempts to refactor RDMA live migration by
> introducing a new QIOChannelRDMA class based on the rsocket API.
>
> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> that is a 1-1 match of the normal kernel 'sockets' API, which hides the
> detail of rdma protocol into rsocket and allows us to add support for
> some modern features like multifd more easily.
>
> Here is the previous discussion on refactoring RDMA live migration using
> the rsocket API:
>
> https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linaro.org/
>
> We have encountered some bugs when using rsocket and plan to submit them to
> the rdma-core community.
>
> In addition, the use of rsocket makes our programming more convenient,
> but it must be noted that this method introduces multiple memory copies,
> which can be imagined that there will be a certain performance degradation,
> hoping that friends with RDMA network cards can help verify, thank you!
First thx for the effort, we are running migration tests on our IB
fabric, different generation of HCA from mellanox, the migration works
ok,
there are a few failures,  Yu will share the result later separately.

The one blocker for the change is the old implementation and the new
rsocket implementation;
they don't talk to each other due to the effect of different wire
protocol during connection establishment.
eg the old RDMA migration has special control message during the
migration flow, which rsocket use a different control message, so
there lead to no way
to migrate VM using rdma transport pre to the rsocket patchset to a
new version with rsocket implementation.

Probably we should keep both implementation for a while, mark the old
implementation as deprecated, and promote the new implementation, and
high light in doc,
they are not compatible.

Regards!
Jinpu



>
> Jialin Wang (6):
>   migration: remove RDMA live migration temporarily
>   io: add QIOChannelRDMA class
>   io/channel-rdma: support working in coroutine
>   tests/unit: add test-io-channel-rdma.c
>   migration: introduce new RDMA live migration
>   migration/rdma: support multifd for RDMA migration
>
>  docs/rdma.txt                     |  420 ---
>  include/io/channel-rdma.h         |  165 ++
>  io/channel-rdma.c                 |  798 ++++++
>  io/meson.build                    |    1 +
>  io/trace-events                   |   14 +
>  meson.build                       |    6 -
>  migration/meson.build             |    3 +-
>  migration/migration-stats.c       |    5 +-
>  migration/migration-stats.h       |    4 -
>  migration/migration.c             |   13 +-
>  migration/migration.h             |    9 -
>  migration/multifd.c               |   10 +
>  migration/options.c               |   16 -
>  migration/options.h               |    2 -
>  migration/qemu-file.c             |    1 -
>  migration/ram.c                   |   90 +-
>  migration/rdma.c                  | 4205 +----------------------------
>  migration/rdma.h                  |   67 +-
>  migration/savevm.c                |    2 +-
>  migration/trace-events            |   68 +-
>  qapi/migration.json               |   13 +-
>  scripts/analyze-migration.py      |    3 -
>  tests/unit/meson.build            |    1 +
>  tests/unit/test-io-channel-rdma.c |  276 ++
>  24 files changed, 1360 insertions(+), 4832 deletions(-)
>  delete mode 100644 docs/rdma.txt
>  create mode 100644 include/io/channel-rdma.h
>  create mode 100644 io/channel-rdma.c
>  create mode 100644 tests/unit/test-io-channel-rdma.c
>
> --
> 2.43.0
>

Gonglei (Arei) June 7, 2024, 8:28 a.m. UTC | #10

> -----Original Message-----
> From: Jinpu Wang [mailto:jinpu.wang@ionos.com]
> Sent: Friday, June 7, 2024 1:54 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
> mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
> <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> lizhijian@fujitsu.com; pbonzini@redhat.com; mst@redhat.com; Xiexiangyou
> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> <lixiao91@huawei.com>; Wangjialin <wangjialin23@huawei.com>
> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> 
> Hi Gonglei, hi folks on the list,
> 
> On Tue, Jun 4, 2024 at 2:14 PM Gonglei <arei.gonglei@huawei.com> wrote:
> >
> > From: Jialin Wang <wangjialin23@huawei.com>
> >
> > Hi,
> >
> > This patch series attempts to refactor RDMA live migration by
> > introducing a new QIOChannelRDMA class based on the rsocket API.
> >
> > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > the detail of rdma protocol into rsocket and allows us to add support
> > for some modern features like multifd more easily.
> >
> > Here is the previous discussion on refactoring RDMA live migration
> > using the rsocket API:
> >
> > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > o.org/
> >
> > We have encountered some bugs when using rsocket and plan to submit
> > them to the rdma-core community.
> >
> > In addition, the use of rsocket makes our programming more convenient,
> > but it must be noted that this method introduces multiple memory
> > copies, which can be imagined that there will be a certain performance
> > degradation, hoping that friends with RDMA network cards can help verify,
> thank you!
> First thx for the effort, we are running migration tests on our IB fabric, different
> generation of HCA from mellanox, the migration works ok, there are a few
> failures,  Yu will share the result later separately.
> 

Thank you so much. 

> The one blocker for the change is the old implementation and the new rsocket
> implementation; they don't talk to each other due to the effect of different wire
> protocol during connection establishment.
> eg the old RDMA migration has special control message during the migration
> flow, which rsocket use a different control message, so there lead to no way to
> migrate VM using rdma transport pre to the rsocket patchset to a new version
> with rsocket implementation.
> 
> Probably we should keep both implementation for a while, mark the old
> implementation as deprecated, and promote the new implementation, and
> high light in doc, they are not compatible.
> 

IMO It makes sense. What's your opinion? @Peter.


Regards,
-Gonglei

> Regards!
> Jinpu
> 
> 
> 
> >
> > Jialin Wang (6):
> >   migration: remove RDMA live migration temporarily
> >   io: add QIOChannelRDMA class
> >   io/channel-rdma: support working in coroutine
> >   tests/unit: add test-io-channel-rdma.c
> >   migration: introduce new RDMA live migration
> >   migration/rdma: support multifd for RDMA migration
> >
> >  docs/rdma.txt                     |  420 ---
> >  include/io/channel-rdma.h         |  165 ++
> >  io/channel-rdma.c                 |  798 ++++++
> >  io/meson.build                    |    1 +
> >  io/trace-events                   |   14 +
> >  meson.build                       |    6 -
> >  migration/meson.build             |    3 +-
> >  migration/migration-stats.c       |    5 +-
> >  migration/migration-stats.h       |    4 -
> >  migration/migration.c             |   13 +-
> >  migration/migration.h             |    9 -
> >  migration/multifd.c               |   10 +
> >  migration/options.c               |   16 -
> >  migration/options.h               |    2 -
> >  migration/qemu-file.c             |    1 -
> >  migration/ram.c                   |   90 +-
> >  migration/rdma.c                  | 4205 +----------------------------
> >  migration/rdma.h                  |   67 +-
> >  migration/savevm.c                |    2 +-
> >  migration/trace-events            |   68 +-
> >  qapi/migration.json               |   13 +-
> >  scripts/analyze-migration.py      |    3 -
> >  tests/unit/meson.build            |    1 +
> >  tests/unit/test-io-channel-rdma.c |  276 ++
> >  24 files changed, 1360 insertions(+), 4832 deletions(-)  delete mode
> > 100644 docs/rdma.txt  create mode 100644 include/io/channel-rdma.h
> > create mode 100644 io/channel-rdma.c  create mode 100644
> > tests/unit/test-io-channel-rdma.c
> >
> > --
> > 2.43.0
> >

Gonglei (Arei) June 7, 2024, 8:49 a.m. UTC | #11

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, June 5, 2024 10:19 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: qemu-devel@nongnu.org; yu.zhang@ionos.com; mgalaxy@akamai.com;
> elmar.gerdes@ionos.com; zhengchuan <zhengchuan@huawei.com>;
> berrange@redhat.com; armbru@redhat.com; lizhijian@fujitsu.com;
> pbonzini@redhat.com; mst@redhat.com; Xiexiangyou
> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> <wangjialin23@huawei.com>; Fabiano Rosas <farosas@suse.de>
> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> 
> On Wed, Jun 05, 2024 at 10:09:43AM +0000, Gonglei (Arei) wrote:
> > Hi Peter,
> >
> > > -----Original Message-----
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Wednesday, June 5, 2024 3:32 AM
> > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > Cc: qemu-devel@nongnu.org; yu.zhang@ionos.com;
> mgalaxy@akamai.com;
> > > elmar.gerdes@ionos.com; zhengchuan <zhengchuan@huawei.com>;
> > > berrange@redhat.com; armbru@redhat.com; lizhijian@fujitsu.com;
> > > pbonzini@redhat.com; mst@redhat.com; Xiexiangyou
> > > <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> > > <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> > > <wangjialin23@huawei.com>; Fabiano Rosas <farosas@suse.de>
> > > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on
> > > rsocket API
> > >
> > > Hi, Lei, Jialin,
> > >
> > > Thanks a lot for working on this!
> > >
> > > I think we'll need to wait a bit on feedbacks from Jinpu and his
> > > team on RDMA side, also Daniel for iochannels.  Also, please
> > > remember to copy Fabiano Rosas in any relevant future posts.  We'd
> > > also like to know whether he has any comments too.  I have him copied in
> this reply.
> > >
> > > On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > > > From: Jialin Wang <wangjialin23@huawei.com>
> > > >
> > > > Hi,
> > > >
> > > > This patch series attempts to refactor RDMA live migration by
> > > > introducing a new QIOChannelRDMA class based on the rsocket API.
> > > >
> > > > The /usr/include/rdma/rsocket.h provides a higher level rsocket
> > > > API that is a 1-1 match of the normal kernel 'sockets' API, which
> > > > hides the detail of rdma protocol into rsocket and allows us to
> > > > add support for some modern features like multifd more easily.
> > > >
> > > > Here is the previous discussion on refactoring RDMA live migration
> > > > using the rsocket API:
> > > >
> > > > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@l
> > > > inar
> > > > o.org/
> > > >
> > > > We have encountered some bugs when using rsocket and plan to
> > > > submit them to the rdma-core community.
> > > >
> > > > In addition, the use of rsocket makes our programming more
> > > > convenient, but it must be noted that this method introduces
> > > > multiple memory copies, which can be imagined that there will be a
> > > > certain performance degradation, hoping that friends with RDMA
> > > > network cards can help verify,
> > > thank you!
> > >
> > > It'll be good to elaborate if you tested it in-house. What people
> > > should expect on the numbers exactly?  Is that okay from Huawei's POV?
> > >
> > > Besides that, the code looks pretty good at a first glance to me.
> > > Before others chim in, here're some high level comments..
> > >
> > > Firstly, can we avoid using coroutine when listen()?  Might be
> > > relevant when I see that rdma_accept_incoming_migration() runs in a
> > > loop to do raccept(), but would that also hang the qemu main loop
> > > even with the coroutine, before all channels are ready?  I'm not a
> > > coroutine person, but I think the hope is that we can make dest QEMU
> > > run in a thread in the future just like the src QEMU, so the less coroutine
> the better in this path.
> > >
> >
> > Because rsocket is set to non-blocking, raccept will return EAGAIN
> > when no connection is received, coroutine will yield, and will not hang the
> qemu main loop.
> 
> Ah that's ok.  And also I just noticed it may not be a big deal either as long as
> we're before migration_incoming_process().
> 
> I'm wondering whether it can do it similarly like what we do with sockets in
> qio_net_listener_set_client_func_full().  After all, rsocket wants to mimic the
> socket API.  It'll make sense if rsocket code tries to match with socket, or
> even reuse.
> 

Actually we tried this solution, but it didn't work. Pls see patch 3/6

Known limitations: 
  For a blocking rsocket fd, if we use io_create_watch to wait for
  POLLIN or POLLOUT events, since the rsocket fd is blocking, we
  cannot determine when it is not ready to read/write as we can with
  non-blocking fds. Therefore, when an event occurs, it will occurs
  always, potentially leave the qemu hanging. So we need be cautious
  to avoid hanging when using io_create_watch .


Regards,
-Gonglei

> >
> > > I think I also left a comment elsewhere on whether it would be
> > > possible to allow iochannels implement their own poll() functions to
> > > avoid the per-channel poll thread that is proposed in this series.
> > >
> > > https://lore.kernel.org/r/ZldY21xVExtiMddB@x1n
> > >
> >
> > We noticed that, and it's a big operation. I'm not sure that's a better way.
> >
> > > Personally I think even with the thread proposal it's better than
> > > the old rdma code, but I just still want to double check with you
> > > guys.  E.g., maybe that just won't work at all?  Again, that'll also
> > > be based on the fact that we move migration incoming into a thread
> > > first to keep the dest QEMU main loop intact, I think, but I hope we
> > > will reach that irrelevant of rdma, IOW it'll be nice to happen even earlier if
> possible.
> > >
> > Yep. This is a fairly big change, I wonder what other people's suggestions
> are?
> 
> Yes we can wait for others' opinions.  And btw I'm not asking for it and I don't
> think it'll be a blocker for this approach to land, as I said this is better than the
> current code so it's definitely an improvement to me.
> 
> I'm purely curious, because if you're not going to do it for rdma, maybe
> someday I'll try to do that, and I want to know what "big change" could be as I
> didn't dig further.  It may help me by sharing what issues you've found.
> 
> Thanks,
> 
> --
> Peter Xu

Daniel P. Berrangé June 7, 2024, 10:06 a.m. UTC | #12

On Tue, Jun 04, 2024 at 03:32:19PM -0400, Peter Xu wrote:
> Hi, Lei, Jialin,
> 
> Thanks a lot for working on this!
> 
> I think we'll need to wait a bit on feedbacks from Jinpu and his team on
> RDMA side, also Daniel for iochannels.  Also, please remember to copy
> Fabiano Rosas in any relevant future posts.  We'd also like to know whether
> he has any comments too.  I have him copied in this reply.

I've not formally reviewed it, but I had a quick glance through the
I/O channel patches and they all look sensible. Pretty much  exactly
what I was hoping it would end up looking like.

> > In addition, the use of rsocket makes our programming more convenient,
> > but it must be noted that this method introduces multiple memory copies,
> > which can be imagined that there will be a certain performance degradation,
> > hoping that friends with RDMA network cards can help verify, thank you!
> 
> It'll be good to elaborate if you tested it in-house. What people should
> expect on the numbers exactly?  Is that okay from Huawei's POV?
> 
> Besides that, the code looks pretty good at a first glance to me.  Before

snip

> Personally I think even with the thread proposal it's better than the old
> rdma code, but I just still want to double check with you guys.  E.g.,
> maybe that just won't work at all?  Again, that'll also be based on the
> fact that we move migration incoming into a thread first to keep the dest
> QEMU main loop intact, I think, but I hope we will reach that irrelevant of
> rdma, IOW it'll be nice to happen even earlier if possible.

Yes, from the migration code POV, this is a massive step forward - the
RDMA integration is no completely trivial for migration code.

The $million question is what the performance of this new implmentation
looks like on real hardware. As mentioned above the extra memory copies
will probably hurt performance compared to the old version. We need the
performance of the new RDMA impl to still be better than the plain TCP
sockets backend to make it worthwhile having RDMA.

With regards,
Daniel

Yu Zhang June 7, 2024, 4:24 p.m. UTC | #13

Hello Gonglei,

Jinpu and I have tested your patchset by using our migration test
cases on the physical RDMA cards. The result is: among 59 migration
test cases, 10 failed. They are successful when using the original
RDMA migration coed, but always fail when using the patchset. The
syslog on the source server shows an error below:

Jun  6 13:35:20 ps402a-43 WARN: Migration failed
uuid="44449999-3333-48dc-9082-1b6950e74ee1"
target=2a02:247f:401:2:2:0:a:2c error=Failed(Unable to write to
rsocket: Connection reset by peer)

We also tried to compare the migration speed between w/o the patchset.
Without the patchset, a big VM (with 16 cores, 64 GB memory) stressed
with heavy memory workload can be migrated successfully. With the
patchset, only a small idle VM (1-2 cores, 2-4 GB memory) can be
migrated successfully. In each failed migration, the above error is
issued on the source server.

Therefore, I assume that this version is not yet quite capable of
handling heavy load yet. I'm also looking in the code to see if
anything can be improved. We really appreciate your excellent work!

Best regards,
Yu Zhang @ IONOS cloud

On Wed, Jun 5, 2024 at 12:00 PM Gonglei (Arei) <arei.gonglei@huawei.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Wednesday, June 5, 2024 3:57 PM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
> > mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
> > <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> > lizhijian@fujitsu.com; pbonzini@redhat.com; Xiexiangyou
> > <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> > <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> > <wangjialin23@huawei.com>
> > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> >
> > On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > > From: Jialin Wang <wangjialin23@huawei.com>
> > >
> > > Hi,
> > >
> > > This patch series attempts to refactor RDMA live migration by
> > > introducing a new QIOChannelRDMA class based on the rsocket API.
> > >
> > > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > > the detail of rdma protocol into rsocket and allows us to add support
> > > for some modern features like multifd more easily.
> > >
> > > Here is the previous discussion on refactoring RDMA live migration
> > > using the rsocket API:
> > >
> > > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > > o.org/
> > >
> > > We have encountered some bugs when using rsocket and plan to submit
> > > them to the rdma-core community.
> > >
> > > In addition, the use of rsocket makes our programming more convenient,
> > > but it must be noted that this method introduces multiple memory
> > > copies, which can be imagined that there will be a certain performance
> > > degradation, hoping that friends with RDMA network cards can help verify,
> > thank you!
> >
> > So you didn't test it with an RDMA card?
>
> Yep, we tested it by Soft-ROCE.
>
> > You really should test with an RDMA card though, for correctness as much as
> > performance.
> >
> We will, we just don't have RDMA cards environment on hand at the moment.
>
> Regards,
> -Gonglei
>
> >
> > > Jialin Wang (6):
> > >   migration: remove RDMA live migration temporarily
> > >   io: add QIOChannelRDMA class
> > >   io/channel-rdma: support working in coroutine
> > >   tests/unit: add test-io-channel-rdma.c
> > >   migration: introduce new RDMA live migration
> > >   migration/rdma: support multifd for RDMA migration
> > >
> > >  docs/rdma.txt                     |  420 ---
> > >  include/io/channel-rdma.h         |  165 ++
> > >  io/channel-rdma.c                 |  798 ++++++
> > >  io/meson.build                    |    1 +
> > >  io/trace-events                   |   14 +
> > >  meson.build                       |    6 -
> > >  migration/meson.build             |    3 +-
> > >  migration/migration-stats.c       |    5 +-
> > >  migration/migration-stats.h       |    4 -
> > >  migration/migration.c             |   13 +-
> > >  migration/migration.h             |    9 -
> > >  migration/multifd.c               |   10 +
> > >  migration/options.c               |   16 -
> > >  migration/options.h               |    2 -
> > >  migration/qemu-file.c             |    1 -
> > >  migration/ram.c                   |   90 +-
> > >  migration/rdma.c                  | 4205 +----------------------------
> > >  migration/rdma.h                  |   67 +-
> > >  migration/savevm.c                |    2 +-
> > >  migration/trace-events            |   68 +-
> > >  qapi/migration.json               |   13 +-
> > >  scripts/analyze-migration.py      |    3 -
> > >  tests/unit/meson.build            |    1 +
> > >  tests/unit/test-io-channel-rdma.c |  276 ++
> > >  24 files changed, 1360 insertions(+), 4832 deletions(-)  delete mode
> > > 100644 docs/rdma.txt  create mode 100644 include/io/channel-rdma.h
> > > create mode 100644 io/channel-rdma.c  create mode 100644
> > > tests/unit/test-io-channel-rdma.c
> > >
> > > --
> > > 2.43.0
>

Peter Xu June 10, 2024, 4:31 p.m. UTC | #14

On Fri, Jun 07, 2024 at 08:28:29AM +0000, Gonglei (Arei) wrote:
> 
> 
> > -----Original Message-----
> > From: Jinpu Wang [mailto:jinpu.wang@ionos.com]
> > Sent: Friday, June 7, 2024 1:54 PM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
> > mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
> > <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> > lizhijian@fujitsu.com; pbonzini@redhat.com; mst@redhat.com; Xiexiangyou
> > <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> > <lixiao91@huawei.com>; Wangjialin <wangjialin23@huawei.com>
> > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> > 
> > Hi Gonglei, hi folks on the list,
> > 
> > On Tue, Jun 4, 2024 at 2:14 PM Gonglei <arei.gonglei@huawei.com> wrote:
> > >
> > > From: Jialin Wang <wangjialin23@huawei.com>
> > >
> > > Hi,
> > >
> > > This patch series attempts to refactor RDMA live migration by
> > > introducing a new QIOChannelRDMA class based on the rsocket API.
> > >
> > > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > > the detail of rdma protocol into rsocket and allows us to add support
> > > for some modern features like multifd more easily.
> > >
> > > Here is the previous discussion on refactoring RDMA live migration
> > > using the rsocket API:
> > >
> > > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > > o.org/
> > >
> > > We have encountered some bugs when using rsocket and plan to submit
> > > them to the rdma-core community.
> > >
> > > In addition, the use of rsocket makes our programming more convenient,
> > > but it must be noted that this method introduces multiple memory
> > > copies, which can be imagined that there will be a certain performance
> > > degradation, hoping that friends with RDMA network cards can help verify,
> > thank you!
> > First thx for the effort, we are running migration tests on our IB fabric, different
> > generation of HCA from mellanox, the migration works ok, there are a few
> > failures,  Yu will share the result later separately.
> > 
> 
> Thank you so much. 
> 
> > The one blocker for the change is the old implementation and the new rsocket
> > implementation; they don't talk to each other due to the effect of different wire
> > protocol during connection establishment.
> > eg the old RDMA migration has special control message during the migration
> > flow, which rsocket use a different control message, so there lead to no way to
> > migrate VM using rdma transport pre to the rsocket patchset to a new version
> > with rsocket implementation.
> > 
> > Probably we should keep both implementation for a while, mark the old
> > implementation as deprecated, and promote the new implementation, and
> > high light in doc, they are not compatible.
> > 
> 
> IMO It makes sense. What's your opinion? @Peter.

Sounds good to me.  We can use an internal property field and enable
rsocket rdma migration on new machine types with rdma protocol, deprecating
both old rdma and that internal field after 2 releases.  So that when
receiving rdma migrations it'll use old property (as old qemu will use old
machine types), but when initiating rdma migration on new binary it'll
switch to rsocket.

It might be more important to address either the failures or perf concerns
that others raised, though.

Thanks,

Peter Xu June 10, 2024, 4:35 p.m. UTC | #15

On Fri, Jun 07, 2024 at 08:49:01AM +0000, Gonglei (Arei) wrote:
> Actually we tried this solution, but it didn't work. Pls see patch 3/6
> 
> Known limitations: 
>   For a blocking rsocket fd, if we use io_create_watch to wait for
>   POLLIN or POLLOUT events, since the rsocket fd is blocking, we
>   cannot determine when it is not ready to read/write as we can with
>   non-blocking fds. Therefore, when an event occurs, it will occurs
>   always, potentially leave the qemu hanging. So we need be cautious
>   to avoid hanging when using io_create_watch .

I'm not sure I fully get that part, though.  In:

https://lore.kernel.org/all/ZldY21xVExtiMddB@x1n/

I was thinking of iochannel implements its own poll with the _POLL flag, so
in that case it'll call qio_channel_poll() which should call rpoll()
directly. So I didn't expect using qio_channel_create_watch().  I thought
the context was gmainloop won't work with rsocket fds in general, but maybe
I missed something.

Thanks,

Peter Xu Aug. 27, 2024, 8:15 p.m. UTC | #16

On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> From: Jialin Wang <wangjialin23@huawei.com>
> 
> Hi,
> 
> This patch series attempts to refactor RDMA live migration by
> introducing a new QIOChannelRDMA class based on the rsocket API.
> 
> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> that is a 1-1 match of the normal kernel 'sockets' API, which hides the
> detail of rdma protocol into rsocket and allows us to add support for
> some modern features like multifd more easily.
> 
> Here is the previous discussion on refactoring RDMA live migration using
> the rsocket API:
> 
> https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linaro.org/
> 
> We have encountered some bugs when using rsocket and plan to submit them to
> the rdma-core community.
> 
> In addition, the use of rsocket makes our programming more convenient,
> but it must be noted that this method introduces multiple memory copies,
> which can be imagined that there will be a certain performance degradation,
> hoping that friends with RDMA network cards can help verify, thank you!
> 
> Jialin Wang (6):
>   migration: remove RDMA live migration temporarily
>   io: add QIOChannelRDMA class
>   io/channel-rdma: support working in coroutine
>   tests/unit: add test-io-channel-rdma.c
>   migration: introduce new RDMA live migration
>   migration/rdma: support multifd for RDMA migration

This series has been idle for a while; we still need to know how to move
forward.  I guess I lost the latest status quo..

Any update (from anyone..) on what stage are we in?

Thanks,

Michael S. Tsirkin Aug. 27, 2024, 8:57 p.m. UTC | #17

On Tue, Aug 27, 2024 at 04:15:42PM -0400, Peter Xu wrote:
> On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > From: Jialin Wang <wangjialin23@huawei.com>
> > 
> > Hi,
> > 
> > This patch series attempts to refactor RDMA live migration by
> > introducing a new QIOChannelRDMA class based on the rsocket API.
> > 
> > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > that is a 1-1 match of the normal kernel 'sockets' API, which hides the
> > detail of rdma protocol into rsocket and allows us to add support for
> > some modern features like multifd more easily.
> > 
> > Here is the previous discussion on refactoring RDMA live migration using
> > the rsocket API:
> > 
> > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linaro.org/
> > 
> > We have encountered some bugs when using rsocket and plan to submit them to
> > the rdma-core community.
> > 
> > In addition, the use of rsocket makes our programming more convenient,
> > but it must be noted that this method introduces multiple memory copies,
> > which can be imagined that there will be a certain performance degradation,
> > hoping that friends with RDMA network cards can help verify, thank you!
> > 
> > Jialin Wang (6):
> >   migration: remove RDMA live migration temporarily
> >   io: add QIOChannelRDMA class
> >   io/channel-rdma: support working in coroutine
> >   tests/unit: add test-io-channel-rdma.c
> >   migration: introduce new RDMA live migration
> >   migration/rdma: support multifd for RDMA migration
> 
> This series has been idle for a while; we still need to know how to move
> forward.


What exactly is the question? This got a bunch of comments,
the first thing to do would be to address them.


>  I guess I lost the latest status quo..
> 
> Any update (from anyone..) on what stage are we in?
> 
> Thanks,
> -- 
> Peter Xu

Michael Galaxy Sept. 22, 2024, 7:29 p.m. UTC | #18

Hi All,

I have met with the team from IONOS about their testing on actual IB 
hardware here at KVM Forum today and the requirements are starting to 
make more sense to me. I didn't say much in our previous thread because 
I misunderstood the requirements, so let me try to explain and see if 
we're all on the same page. There appears to be a fundamental limitation 
here with rsocket, for which I don't see how it is possible to overcome.

The basic problem is that rsocket is trying to present a stream 
abstraction, a concept that is fundamentally incompatible with RDMA. The 
whole point of using RDMA in the first place is to avoid using the CPU, 
and to do that, all of the memory (potentially hundreds of gigabytes) 
need to be registered with the hardware *in advance* (this is how the 
original implementation works).

The need to fake a socket/bytestream abstraction eventually breaks down 
=> There is a limit (a few GB) in rsocket (which the IONOS team previous 
reported in testing.... see that email), it appears that means that 
rsocket is only going to be able to map a certain limited amount of 
memory with the hardware until its internal "buffer" runs out before it 
can then unmap and remap the next batch of memory with the hardware to 
continue along with the fake bytestream. This is very much sticking a 
square peg in a round hole. If you were to "relax" the rsocket 
implementation to register the entire VM memory space (as my original 
implementation does), then there wouldn't be any need for rsocket in the 
first place.

I think there is just some misunderstanding here in the group in the way 
infiniband is intended to work. Does that make sense so far? I do 
understand the need for testing, but rsocket is simply not intended to 
be used for kind of massive bulk data transfer purposes that we're 
proposing using it here for, simply for the purposes of making our lives 
better in testing.

Regarding testing: During our previous thread earlier this summer, why 
did we not consider making a better integration test to solve the test 
burden problem? To explain better: If a new integration test were 
written for QEMU and submitted and reviewed (a reasonably complex test 
that was in line with a traditional live migration integration test that 
actually spins up QEMU) which used softRoCE in a localhost configuration 
that has full libibverbs supports and still allowed for compatibility 
testing with QEMU, would such an integration not be sufficient to handle 
the testing burden?

Comments welcome,
- Michael

On 8/27/24 15:57, Michael S. Tsirkin wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> On Tue, Aug 27, 2024 at 04:15:42PM -0400, Peter Xu wrote:
>> On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
>>> From: Jialin Wang <wangjialin23@huawei.com>
>>>
>>> Hi,
>>>
>>> This patch series attempts to refactor RDMA live migration by
>>> introducing a new QIOChannelRDMA class based on the rsocket API.
>>>
>>> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
>>> that is a 1-1 match of the normal kernel 'sockets' API, which hides the
>>> detail of rdma protocol into rsocket and allows us to add support for
>>> some modern features like multifd more easily.
>>>
>>> Here is the previous discussion on refactoring RDMA live migration using
>>> the rsocket API:
>>>
>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linaro.org/__;!!GjvTz_vk!TuRaotO-yMj82o2kQo3x743jLoDElYgrXmp2wOfMTuCS1Y4k2Son1WGsRnZG_YYS9ZgBZ8uRHQ$
>>>
>>> We have encountered some bugs when using rsocket and plan to submit them to
>>> the rdma-core community.
>>>
>>> In addition, the use of rsocket makes our programming more convenient,
>>> but it must be noted that this method introduces multiple memory copies,
>>> which can be imagined that there will be a certain performance degradation,
>>> hoping that friends with RDMA network cards can help verify, thank you!
>>>
>>> Jialin Wang (6):
>>>    migration: remove RDMA live migration temporarily
>>>    io: add QIOChannelRDMA class
>>>    io/channel-rdma: support working in coroutine
>>>    tests/unit: add test-io-channel-rdma.c
>>>    migration: introduce new RDMA live migration
>>>    migration/rdma: support multifd for RDMA migration
>> This series has been idle for a while; we still need to know how to move
>> forward.
>
> What exactly is the question? This got a bunch of comments,
> the first thing to do would be to address them.
>
>
>>   I guess I lost the latest status quo..
>>
>> Any update (from anyone..) on what stage are we in?
>>
>> Thanks,
>> -- 
>> Peter Xu

Gonglei (Arei) Sept. 23, 2024, 1:04 a.m. UTC | #19

Hi,

> -----Original Message-----
> From: Michael Galaxy [mailto:mgalaxy@akamai.com]
> Sent: Monday, September 23, 2024 3:29 AM
> To: Michael S. Tsirkin <mst@redhat.com>; Peter Xu <peterx@redhat.com>
> Cc: Gonglei (Arei) <arei.gonglei@huawei.com>; qemu-devel@nongnu.org;
> yu.zhang@ionos.com; elmar.gerdes@ionos.com; zhengchuan
> <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> lizhijian@fujitsu.com; pbonzini@redhat.com; Xiexiangyou
> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> <wangjialin23@huawei.com>
> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> 
> Hi All,
> 
> I have met with the team from IONOS about their testing on actual IB
> hardware here at KVM Forum today and the requirements are starting to make
> more sense to me. I didn't say much in our previous thread because I
> misunderstood the requirements, so let me try to explain and see if we're all on
> the same page. There appears to be a fundamental limitation here with rsocket,
> for which I don't see how it is possible to overcome.
> 
> The basic problem is that rsocket is trying to present a stream abstraction, a
> concept that is fundamentally incompatible with RDMA. The whole point of
> using RDMA in the first place is to avoid using the CPU, and to do that, all of the
> memory (potentially hundreds of gigabytes) need to be registered with the
> hardware *in advance* (this is how the original implementation works).
> 
> The need to fake a socket/bytestream abstraction eventually breaks down =>
> There is a limit (a few GB) in rsocket (which the IONOS team previous reported
> in testing.... see that email), it appears that means that rsocket is only going to
> be able to map a certain limited amount of memory with the hardware until its
> internal "buffer" runs out before it can then unmap and remap the next batch
> of memory with the hardware to continue along with the fake bytestream. This
> is very much sticking a square peg in a round hole. If you were to "relax" the
> rsocket implementation to register the entire VM memory space (as my
> original implementation does), then there wouldn't be any need for rsocket in
> the first place.
> 

Thank you for your opinion. You're right. RSocket has encountered difficulties in 
transferring large amounts of data. We haven't even figured it out yet. Although
in this practice, we solved several problems with rsocket.

In our practice, we need to quickly complete VM live migration and the downtime 
of live migration must be within 50 ms or less. Therefore, we use RDMA, which is 
an essential requirement. Next, I think we'll do it based on Qemu's native RDMA 
live migration solution. During this period, we really doubted whether RDMA live 
migration was really feasible through rsocket refactoring, so the refactoring plan 
was shelved.


Regards,
-Gonglei

> I think there is just some misunderstanding here in the group in the way
> infiniband is intended to work. Does that make sense so far? I do understand
> the need for testing, but rsocket is simply not intended to be used for kind of
> massive bulk data transfer purposes that we're proposing using it here for,
> simply for the purposes of making our lives better in testing.
> 
> Regarding testing: During our previous thread earlier this summer, why did we
> not consider making a better integration test to solve the test burden problem?
> To explain better: If a new integration test were written for QEMU and
> submitted and reviewed (a reasonably complex test that was in line with a
> traditional live migration integration test that actually spins up QEMU) which
> used softRoCE in a localhost configuration that has full libibverbs supports and
> still allowed for compatibility testing with QEMU, would such an integration not
> be sufficient to handle the testing burden?
> 
> Comments welcome,
> - Michael
> 
> On 8/27/24 15:57, Michael S. Tsirkin wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> >
> > On Tue, Aug 27, 2024 at 04:15:42PM -0400, Peter Xu wrote:
> >> On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> >>> From: Jialin Wang <wangjialin23@huawei.com>
> >>>
> >>> Hi,
> >>>
> >>> This patch series attempts to refactor RDMA live migration by
> >>> introducing a new QIOChannelRDMA class based on the rsocket API.
> >>>
> >>> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> >>> that is a 1-1 match of the normal kernel 'sockets' API, which hides
> >>> the detail of rdma protocol into rsocket and allows us to add
> >>> support for some modern features like multifd more easily.
> >>>
> >>> Here is the previous discussion on refactoring RDMA live migration
> >>> using the rsocket API:
> >>>
> >>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/20240
> >>>
> 328130255.52257-1-philmd@linaro.org/__;!!GjvTz_vk!TuRaotO-yMj82o2kQo
> >>> 3x743jLoDElYgrXmp2wOfMTuCS1Y4k2Son1WGsRnZG_YYS9ZgBZ8uRHQ$
> >>>
> >>> We have encountered some bugs when using rsocket and plan to submit
> >>> them to the rdma-core community.
> >>>
> >>> In addition, the use of rsocket makes our programming more
> >>> convenient, but it must be noted that this method introduces
> >>> multiple memory copies, which can be imagined that there will be a
> >>> certain performance degradation, hoping that friends with RDMA network
> cards can help verify, thank you!
> >>>
> >>> Jialin Wang (6):
> >>>    migration: remove RDMA live migration temporarily
> >>>    io: add QIOChannelRDMA class
> >>>    io/channel-rdma: support working in coroutine
> >>>    tests/unit: add test-io-channel-rdma.c
> >>>    migration: introduce new RDMA live migration
> >>>    migration/rdma: support multifd for RDMA migration
> >> This series has been idle for a while; we still need to know how to
> >> move forward.
> >
> > What exactly is the question? This got a bunch of comments, the first
> > thing to do would be to address them.
> >
> >
> >>   I guess I lost the latest status quo..
> >>
> >> Any update (from anyone..) on what stage are we in?
> >>
> >> Thanks,
> >> --
> >> Peter Xu

Peter Xu Sept. 25, 2024, 3:08 p.m. UTC | #20

On Mon, Sep 23, 2024 at 01:04:17AM +0000, Gonglei (Arei) wrote:
> Hi,
> 
> > -----Original Message-----
> > From: Michael Galaxy [mailto:mgalaxy@akamai.com]
> > Sent: Monday, September 23, 2024 3:29 AM
> > To: Michael S. Tsirkin <mst@redhat.com>; Peter Xu <peterx@redhat.com>
> > Cc: Gonglei (Arei) <arei.gonglei@huawei.com>; qemu-devel@nongnu.org;
> > yu.zhang@ionos.com; elmar.gerdes@ionos.com; zhengchuan
> > <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> > lizhijian@fujitsu.com; pbonzini@redhat.com; Xiexiangyou
> > <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> > <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> > <wangjialin23@huawei.com>
> > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> > 
> > Hi All,
> > 
> > I have met with the team from IONOS about their testing on actual IB
> > hardware here at KVM Forum today and the requirements are starting to make
> > more sense to me. I didn't say much in our previous thread because I
> > misunderstood the requirements, so let me try to explain and see if we're all on
> > the same page. There appears to be a fundamental limitation here with rsocket,
> > for which I don't see how it is possible to overcome.
> > 
> > The basic problem is that rsocket is trying to present a stream abstraction, a
> > concept that is fundamentally incompatible with RDMA. The whole point of
> > using RDMA in the first place is to avoid using the CPU, and to do that, all of the
> > memory (potentially hundreds of gigabytes) need to be registered with the
> > hardware *in advance* (this is how the original implementation works).
> > 
> > The need to fake a socket/bytestream abstraction eventually breaks down =>
> > There is a limit (a few GB) in rsocket (which the IONOS team previous reported
> > in testing.... see that email), it appears that means that rsocket is only going to
> > be able to map a certain limited amount of memory with the hardware until its
> > internal "buffer" runs out before it can then unmap and remap the next batch
> > of memory with the hardware to continue along with the fake bytestream. This
> > is very much sticking a square peg in a round hole. If you were to "relax" the
> > rsocket implementation to register the entire VM memory space (as my
> > original implementation does), then there wouldn't be any need for rsocket in
> > the first place.

Yes, some test like this can be helpful.

And thanks for the summary.  That's definitely helpful.

One question from my side (as someone knows nothing on RDMA/rsocket): is
that "a few GBs" limitation a software guard?  Would it be possible that
rsocket provide some option to allow user opt-in on setting that value, so
that it might work for VM use case?  Would that consume similar resources
v.s. the current QEMU impl but allows it to use rsockets with no perf
regressions?

> 
> Thank you for your opinion. You're right. RSocket has encountered difficulties in 
> transferring large amounts of data. We haven't even figured it out yet. Although
> in this practice, we solved several problems with rsocket.
> 
> In our practice, we need to quickly complete VM live migration and the downtime 
> of live migration must be within 50 ms or less. Therefore, we use RDMA, which is 
> an essential requirement. Next, I think we'll do it based on Qemu's native RDMA 
> live migration solution. During this period, we really doubted whether RDMA live 
> migration was really feasible through rsocket refactoring, so the refactoring plan 
> was shelved.

To me, 50ms guaranteed is hard.  I'm personally not sure how much RDMA
helps if that's only about the transport.

I meant, at least I feel like someone would need to work out some general
limitations, like:

https://wiki.qemu.org/ToDo/LiveMigration#Optimize_memory_updates_for_non-iterative_vmstates
https://lore.kernel.org/all/20230317081904.24389-1-xuchuangxclwt@bytedance.com/

I also remember we always have outliers that when save()/load() device
states it can simply be slower (takes 100ms or more on a single device; I
think it's normally has kernel/kvm involved).  That one device can already
break the rule, even if happens rarely.

We also haven't looked into multiple other issues during downtime:

  - vm start/stop will invoke notifiers, and notifiers can (in some cases)
    take quite some time to finish

  - some features may enlarge downtime in an unpredictable way, but so far
    we don't yet have full control of it, e.g. pause-before-switchover for
    block layers

There can be other stuff floating, just to provide some examples.  All
these cases I mentioned above are not relevant to transport on its own.

Thanks,

Michael Galaxy Sept. 27, 2024, 8:34 p.m. UTC | #21

Hi Gonglei,

On 9/22/24 20:04, Gonglei (Arei) wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> Hi,
>
>> -----Original Message-----
>> From: Michael Galaxy [mailto:mgalaxy@akamai.com]
>> Sent: Monday, September 23, 2024 3:29 AM
>> To: Michael S. Tsirkin <mst@redhat.com>; Peter Xu <peterx@redhat.com>
>> Cc: Gonglei (Arei) <arei.gonglei@huawei.com>; qemu-devel@nongnu.org;
>> yu.zhang@ionos.com; elmar.gerdes@ionos.com; zhengchuan
>> <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
>> lizhijian@fujitsu.com; pbonzini@redhat.com; Xiexiangyou
>> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
>> <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
>> <wangjialin23@huawei.com>
>> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
>>
>> Hi All,
>>
>> I have met with the team from IONOS about their testing on actual IB
>> hardware here at KVM Forum today and the requirements are starting to make
>> more sense to me. I didn't say much in our previous thread because I
>> misunderstood the requirements, so let me try to explain and see if we're all on
>> the same page. There appears to be a fundamental limitation here with rsocket,
>> for which I don't see how it is possible to overcome.
>>
>> The basic problem is that rsocket is trying to present a stream abstraction, a
>> concept that is fundamentally incompatible with RDMA. The whole point of
>> using RDMA in the first place is to avoid using the CPU, and to do that, all of the
>> memory (potentially hundreds of gigabytes) need to be registered with the
>> hardware *in advance* (this is how the original implementation works).
>>
>> The need to fake a socket/bytestream abstraction eventually breaks down =>
>> There is a limit (a few GB) in rsocket (which the IONOS team previous reported
>> in testing.... see that email), it appears that means that rsocket is only going to
>> be able to map a certain limited amount of memory with the hardware until its
>> internal "buffer" runs out before it can then unmap and remap the next batch
>> of memory with the hardware to continue along with the fake bytestream. This
>> is very much sticking a square peg in a round hole. If you were to "relax" the
>> rsocket implementation to register the entire VM memory space (as my
>> original implementation does), then there wouldn't be any need for rsocket in
>> the first place.
>>
> Thank you for your opinion. You're right. RSocket has encountered difficulties in
> transferring large amounts of data. We haven't even figured it out yet. Although
> in this practice, we solved several problems with rsocket.
>
> In our practice, we need to quickly complete VM live migration and the downtime
> of live migration must be within 50 ms or less. Therefore, we use RDMA, which is
> an essential requirement. Next, I think we'll do it based on Qemu's native RDMA
> live migration solution. During this period, we really doubted whether RDMA live
> migration was really feasible through rsocket refactoring, so the refactoring plan
> was shelved.
>
>
> Regards,
> -Gonglei

OK, this is helpful. Thanks for the response.

So that means we do still have two consumers of the native libibverbs 
RDMA solution.

Comments are still welcome. Is there still a reason to pursue this line 
of work that I might be missing?

- Michael

Sean Hefty Sept. 27, 2024, 9:45 p.m. UTC | #22

> > > I have met with the team from IONOS about their testing on actual IB
> > > hardware here at KVM Forum today and the requirements are starting
> > > to make more sense to me. I didn't say much in our previous thread
> > > because I misunderstood the requirements, so let me try to explain
> > > and see if we're all on the same page. There appears to be a
> > > fundamental limitation here with rsocket, for which I don't see how it is
> possible to overcome.
> > >
> > > The basic problem is that rsocket is trying to present a stream
> > > abstraction, a concept that is fundamentally incompatible with RDMA.
> > > The whole point of using RDMA in the first place is to avoid using
> > > the CPU, and to do that, all of the memory (potentially hundreds of
> > > gigabytes) need to be registered with the hardware *in advance* (this is
> how the original implementation works).
> > >
> > > The need to fake a socket/bytestream abstraction eventually breaks
> > > down => There is a limit (a few GB) in rsocket (which the IONOS team
> > > previous reported in testing.... see that email), it appears that
> > > means that rsocket is only going to be able to map a certain limited
> > > amount of memory with the hardware until its internal "buffer" runs
> > > out before it can then unmap and remap the next batch of memory with
> > > the hardware to continue along with the fake bytestream. This is
> > > very much sticking a square peg in a round hole. If you were to
> > > "relax" the rsocket implementation to register the entire VM memory
> > > space (as my original implementation does), then there wouldn't be any
> need for rsocket in the first place.
> 
> Yes, some test like this can be helpful.
> 
> And thanks for the summary.  That's definitely helpful.
> 
> One question from my side (as someone knows nothing on RDMA/rsocket): is
> that "a few GBs" limitation a software guard?  Would it be possible that rsocket
> provide some option to allow user opt-in on setting that value, so that it might
> work for VM use case?  Would that consume similar resources v.s. the current
> QEMU impl but allows it to use rsockets with no perf regressions?

Rsockets is emulated the streaming socket API.  The amount of memory dedicated to a single rsocket is controlled through a wmem_default configuration setting.  It is also configurable via rsetsockopt() SO_SNDBUF.  Both of those are similar to TCP settings.  The SW field used to store this value is 32-bits.

This internal buffer acts as a bounce buffer to convert the synchronous socket API calls into the asynchronous RDMA transfers.  Rsockets uses the CPU for data copies, but the transport is offloaded to the NIC, including kernel bypass.

Does your kernel allocate > 4 GBs of buffer space to an individual socket?

Michael Galaxy Sept. 28, 2024, 5:52 p.m. UTC | #23

On 9/27/24 16:45, Sean Hefty wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
>>>> I have met with the team from IONOS about their testing on actual IB
>>>> hardware here at KVM Forum today and the requirements are starting
>>>> to make more sense to me. I didn't say much in our previous thread
>>>> because I misunderstood the requirements, so let me try to explain
>>>> and see if we're all on the same page. There appears to be a
>>>> fundamental limitation here with rsocket, for which I don't see how it is
>> possible to overcome.
>>>> The basic problem is that rsocket is trying to present a stream
>>>> abstraction, a concept that is fundamentally incompatible with RDMA.
>>>> The whole point of using RDMA in the first place is to avoid using
>>>> the CPU, and to do that, all of the memory (potentially hundreds of
>>>> gigabytes) need to be registered with the hardware *in advance* (this is
>> how the original implementation works).
>>>> The need to fake a socket/bytestream abstraction eventually breaks
>>>> down => There is a limit (a few GB) in rsocket (which the IONOS team
>>>> previous reported in testing.... see that email), it appears that
>>>> means that rsocket is only going to be able to map a certain limited
>>>> amount of memory with the hardware until its internal "buffer" runs
>>>> out before it can then unmap and remap the next batch of memory with
>>>> the hardware to continue along with the fake bytestream. This is
>>>> very much sticking a square peg in a round hole. If you were to
>>>> "relax" the rsocket implementation to register the entire VM memory
>>>> space (as my original implementation does), then there wouldn't be any
>> need for rsocket in the first place.
>>
>> Yes, some test like this can be helpful.
>>
>> And thanks for the summary.  That's definitely helpful.
>>
>> One question from my side (as someone knows nothing on RDMA/rsocket): is
>> that "a few GBs" limitation a software guard?  Would it be possible that rsocket
>> provide some option to allow user opt-in on setting that value, so that it might
>> work for VM use case?  Would that consume similar resources v.s. the current
>> QEMU impl but allows it to use rsockets with no perf regressions?
> Rsockets is emulated the streaming socket API.  The amount of memory dedicated to a single rsocket is controlled through a wmem_default configuration setting.  It is also configurable via rsetsockopt() SO_SNDBUF.  Both of those are similar to TCP settings.  The SW field used to store this value is 32-bits.
>
> This internal buffer acts as a bounce buffer to convert the synchronous socket API calls into the asynchronous RDMA transfers.  Rsockets uses the CPU for data copies, but the transport is offloaded to the NIC, including kernel bypass.
Understood.
> Does your kernel allocate > 4 GBs of buffer space to an individual socket?
Yes, it absolutely does. We're dealing with virtual machines here, 
right? It is possible (and likely) to have a virtual machine that is 
hundreds of GBs of RAM in size.

A bounce buffer defeats the entire purpose of using RDMA in these cases. 
When using RDMA for very large transfers like this, the goal here is to 
map the entire memory region at once and avoid all CPU interactions 
(except for message management within libibverbs) so that the NIC is 
doing all of the work.

I'm sure rsocket has its place with much smaller transfer sizes, but 
this is very different.

- Michael

Michael S. Tsirkin Sept. 29, 2024, 6:14 p.m. UTC | #24

On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
> A bounce buffer defeats the entire purpose of using RDMA in these cases.
> When using RDMA for very large transfers like this, the goal here is to map
> the entire memory region at once and avoid all CPU interactions (except for
> message management within libibverbs) so that the NIC is doing all of the
> work.
> 
> I'm sure rsocket has its place with much smaller transfer sizes, but this is
> very different.

To clarify, are you actively using rdma based migration in production? Stepping up
to help maintain it?

Michael Galaxy Sept. 29, 2024, 8:26 p.m. UTC | #25

On 9/29/24 13:14, Michael S. Tsirkin wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
>> A bounce buffer defeats the entire purpose of using RDMA in these cases.
>> When using RDMA for very large transfers like this, the goal here is to map
>> the entire memory region at once and avoid all CPU interactions (except for
>> message management within libibverbs) so that the NIC is doing all of the
>> work.
>>
>> I'm sure rsocket has its place with much smaller transfer sizes, but this is
>> very different.
> To clarify, are you actively using rdma based migration in production? Stepping up
> to help maintain it?
>
Yes, both Huawei and IONOS have both been contributing here in this 
email thread.

They are both using it in production.

- Michael

Michael S. Tsirkin Sept. 29, 2024, 10:26 p.m. UTC | #26

On Sun, Sep 29, 2024 at 03:26:58PM -0500, Michael Galaxy wrote:
> 
> On 9/29/24 13:14, Michael S. Tsirkin wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> > 
> > On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
> > > A bounce buffer defeats the entire purpose of using RDMA in these cases.
> > > When using RDMA for very large transfers like this, the goal here is to map
> > > the entire memory region at once and avoid all CPU interactions (except for
> > > message management within libibverbs) so that the NIC is doing all of the
> > > work.
> > > 
> > > I'm sure rsocket has its place with much smaller transfer sizes, but this is
> > > very different.
> > To clarify, are you actively using rdma based migration in production? Stepping up
> > to help maintain it?
> > 
> Yes, both Huawei and IONOS have both been contributing here in this email
> thread.
> 
> They are both using it in production.
> 
> - Michael

Well, any plans to work on it? for example, postcopy does not really
do zero copy last time I checked, there's also a long TODO list.

Michael Galaxy Sept. 30, 2024, 3 p.m. UTC | #27

On 9/29/24 17:26, Michael S. Tsirkin wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> On Sun, Sep 29, 2024 at 03:26:58PM -0500, Michael Galaxy wrote:
>> On 9/29/24 13:14, Michael S. Tsirkin wrote:
>>> !-------------------------------------------------------------------|
>>>     This Message Is From an External Sender
>>>     This message came from outside your organization.
>>> |-------------------------------------------------------------------!
>>>
>>> On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
>>>> A bounce buffer defeats the entire purpose of using RDMA in these cases.
>>>> When using RDMA for very large transfers like this, the goal here is to map
>>>> the entire memory region at once and avoid all CPU interactions (except for
>>>> message management within libibverbs) so that the NIC is doing all of the
>>>> work.
>>>>
>>>> I'm sure rsocket has its place with much smaller transfer sizes, but this is
>>>> very different.
>>> To clarify, are you actively using rdma based migration in production? Stepping up
>>> to help maintain it?
>>>
>> Yes, both Huawei and IONOS have both been contributing here in this email
>> thread.
>>
>> They are both using it in production.
>>
>> - Michael
> Well, any plans to work on it? for example, postcopy does not really
> do zero copy last time I checked, there's also a long TODO list.
>
I apologize, I'm not following the question here. Isn't that what this 
thread is about?

So, some background is missing here, perhaps: A few months ago, there 
was a proposal
to remove native RDMA support from live migration due to concerns about 
lack of testability.
Both IONOS and Huawei have stepped up that they are using it and are 
engaging with the
community here. I also proposed transferring over maintainership to them 
as well.  (I  no longer
have any of this hardware, so I cannot provide testing support anymore).

During that time, rsocket was proposed as an alternative, but as I have 
laid out above, I believe
it cannot work for technical reasons.

I also asked earlier in the thread if we can cover the community's 
testing concerns using softroce,
so that an integration test can be made to work (presumably through 
avocado or something similar).

Does that history make sense?

- Michael

Yu Zhang Sept. 30, 2024, 3:31 p.m. UTC | #28

Hello Michael,

That's true. To my understanding, to ease the maintenance, Gonglei's
team has taken efforts to refactorize the RDMA migration code by using
rsocket. However, due to a certain limitation in rsocket, it turned
out that only small VM (in terms of core number and memory) can be
migrated successfully. As long as this limitation persists, no
progress can be achieved in this direction. One the other hand, a
proper test environment and integration / regression test cases are
expected to catch any possible regression due to new changes. It seems
that currently, we can go in this direction.

Best regards,
Yu Zhang @ IONOS cloud

On Mon, Sep 30, 2024 at 5:00 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
>
>
> On 9/29/24 17:26, Michael S. Tsirkin wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> >
> > On Sun, Sep 29, 2024 at 03:26:58PM -0500, Michael Galaxy wrote:
> >> On 9/29/24 13:14, Michael S. Tsirkin wrote:
> >>> !-------------------------------------------------------------------|
> >>>     This Message Is From an External Sender
> >>>     This message came from outside your organization.
> >>> |-------------------------------------------------------------------!
> >>>
> >>> On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
> >>>> A bounce buffer defeats the entire purpose of using RDMA in these cases.
> >>>> When using RDMA for very large transfers like this, the goal here is to map
> >>>> the entire memory region at once and avoid all CPU interactions (except for
> >>>> message management within libibverbs) so that the NIC is doing all of the
> >>>> work.
> >>>>
> >>>> I'm sure rsocket has its place with much smaller transfer sizes, but this is
> >>>> very different.
> >>> To clarify, are you actively using rdma based migration in production? Stepping up
> >>> to help maintain it?
> >>>
> >> Yes, both Huawei and IONOS have both been contributing here in this email
> >> thread.
> >>
> >> They are both using it in production.
> >>
> >> - Michael
> > Well, any plans to work on it? for example, postcopy does not really
> > do zero copy last time I checked, there's also a long TODO list.
> >
> I apologize, I'm not following the question here. Isn't that what this
> thread is about?
>
> So, some background is missing here, perhaps: A few months ago, there
> was a proposal
> to remove native RDMA support from live migration due to concerns about
> lack of testability.
> Both IONOS and Huawei have stepped up that they are using it and are
> engaging with the
> community here. I also proposed transferring over maintainership to them
> as well.  (I  no longer
> have any of this hardware, so I cannot provide testing support anymore).
>
> During that time, rsocket was proposed as an alternative, but as I have
> laid out above, I believe
> it cannot work for technical reasons.
>
> I also asked earlier in the thread if we can cover the community's
> testing concerns using softroce,
> so that an integration test can be made to work (presumably through
> avocado or something similar).
>
> Does that history make sense?
>
> - Michael
>

Peter Xu Sept. 30, 2024, 6:16 p.m. UTC | #29

On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
> On 9/27/24 16:45, Sean Hefty wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> > 
> > > > > I have met with the team from IONOS about their testing on actual IB
> > > > > hardware here at KVM Forum today and the requirements are starting
> > > > > to make more sense to me. I didn't say much in our previous thread
> > > > > because I misunderstood the requirements, so let me try to explain
> > > > > and see if we're all on the same page. There appears to be a
> > > > > fundamental limitation here with rsocket, for which I don't see how it is
> > > possible to overcome.
> > > > > The basic problem is that rsocket is trying to present a stream
> > > > > abstraction, a concept that is fundamentally incompatible with RDMA.
> > > > > The whole point of using RDMA in the first place is to avoid using
> > > > > the CPU, and to do that, all of the memory (potentially hundreds of
> > > > > gigabytes) need to be registered with the hardware *in advance* (this is
> > > how the original implementation works).
> > > > > The need to fake a socket/bytestream abstraction eventually breaks
> > > > > down => There is a limit (a few GB) in rsocket (which the IONOS team
> > > > > previous reported in testing.... see that email), it appears that
> > > > > means that rsocket is only going to be able to map a certain limited
> > > > > amount of memory with the hardware until its internal "buffer" runs
> > > > > out before it can then unmap and remap the next batch of memory with
> > > > > the hardware to continue along with the fake bytestream. This is
> > > > > very much sticking a square peg in a round hole. If you were to
> > > > > "relax" the rsocket implementation to register the entire VM memory
> > > > > space (as my original implementation does), then there wouldn't be any
> > > need for rsocket in the first place.
> > > 
> > > Yes, some test like this can be helpful.
> > > 
> > > And thanks for the summary.  That's definitely helpful.
> > > 
> > > One question from my side (as someone knows nothing on RDMA/rsocket): is
> > > that "a few GBs" limitation a software guard?  Would it be possible that rsocket
> > > provide some option to allow user opt-in on setting that value, so that it might
> > > work for VM use case?  Would that consume similar resources v.s. the current
> > > QEMU impl but allows it to use rsockets with no perf regressions?
> > Rsockets is emulated the streaming socket API.  The amount of memory dedicated to a single rsocket is controlled through a wmem_default configuration setting.  It is also configurable via rsetsockopt() SO_SNDBUF.  Both of those are similar to TCP settings.  The SW field used to store this value is 32-bits.
> > 
> > This internal buffer acts as a bounce buffer to convert the synchronous socket API calls into the asynchronous RDMA transfers.  Rsockets uses the CPU for data copies, but the transport is offloaded to the NIC, including kernel bypass.
> Understood.
> > Does your kernel allocate > 4 GBs of buffer space to an individual socket?
> Yes, it absolutely does. We're dealing with virtual machines here, right? It
> is possible (and likely) to have a virtual machine that is hundreds of GBs
> of RAM in size.
> 
> A bounce buffer defeats the entire purpose of using RDMA in these cases.
> When using RDMA for very large transfers like this, the goal here is to map
> the entire memory region at once and avoid all CPU interactions (except for
> message management within libibverbs) so that the NIC is doing all of the
> work.
> 
> I'm sure rsocket has its place with much smaller transfer sizes, but this is
> very different.

Is it possible to make rsocket be friendly with large buffers (>4GB) like
the VM use case?

I also wonder whether there're other applications that may benefit from
this outside of QEMU.

Thanks,

Sean Hefty Sept. 30, 2024, 7:20 p.m. UTC | #30

> > I'm sure rsocket has its place with much smaller transfer sizes, but
> > this is very different.
> 
> Is it possible to make rsocket be friendly with large buffers (>4GB) like the VM
> use case?

If you can perform large VM migrations using streaming sockets, rsockets is likely usable, but it will involve data copies.  The problem is the socket API semantics.

There are rsocket API extensions (riowrite, riomap) to support RDMA write operations.  This avoids the data copy at the target, but not the sender.   (riowrite follows the socket send semantics on buffer ownership.)

It may be possible to enhance rsockets with MSG_ZEROCOPY or io_uring extensions to enable zero-copy for large transfers, but that's not something I've looked at.  True zero copy may require combining MSG_ZEROCOPY with riowrite, but then that moves further away from using traditional socket calls.

- Sean

Peter Xu Sept. 30, 2024, 7:47 p.m. UTC | #31

On Mon, Sep 30, 2024 at 07:20:56PM +0000, Sean Hefty wrote:
> > > I'm sure rsocket has its place with much smaller transfer sizes, but
> > > this is very different.
> > 
> > Is it possible to make rsocket be friendly with large buffers (>4GB) like the VM
> > use case?
> 
> If you can perform large VM migrations using streaming sockets, rsockets is likely usable, but it will involve data copies.  The problem is the socket API semantics.
> 
> There are rsocket API extensions (riowrite, riomap) to support RDMA write operations.  This avoids the data copy at the target, but not the sender.   (riowrite follows the socket send semantics on buffer ownership.)
> 
> It may be possible to enhance rsockets with MSG_ZEROCOPY or io_uring extensions to enable zero-copy for large transfers, but that's not something I've looked at.  True zero copy may require combining MSG_ZEROCOPY with riowrite, but then that moves further away from using traditional socket calls.

Thanks, Sean.

One thing to mention is that QEMU has QIO_CHANNEL_WRITE_FLAG_ZERO_COPY,
which already supports MSG_ZEROCOPY but only on sender side, and only if
when multifd is enabled, because it requires page pinning and alignments,
while it's more challenging to pin a random buffer than a guest page.

Nobody moved on yet with zerocopy recv for TCP; there might be similar
challenges that normal socket APIs may not work easily on top of current
iochannel design, but I don't know well to say..

Not sure whether it means there can be a shared goal with QEMU ultimately
supporting better zerocopy via either TCP or RDMA.  If that's true, maybe
there's chance we can move towards rsocket with all the above facilities,
meanwhile RDMA can, ideally, run similiarly like TCP with the same (to be
enhanced..) iochannel API, so that it can do zerocopy on both sides with
either transport.

Michael Galaxy Oct. 3, 2024, 9:26 p.m. UTC | #32

On 9/30/24 14:47, Peter Xu wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> On Mon, Sep 30, 2024 at 07:20:56PM +0000, Sean Hefty wrote:
>>>> I'm sure rsocket has its place with much smaller transfer sizes, but
>>>> this is very different.
>>> Is it possible to make rsocket be friendly with large buffers (>4GB) like the VM
>>> use case?
>> If you can perform large VM migrations using streaming sockets, rsockets is likely usable, but it will involve data copies.  The problem is the socket API semantics.
>>
>> There are rsocket API extensions (riowrite, riomap) to support RDMA write operations.  This avoids the data copy at the target, but not the sender.   (riowrite follows the socket send semantics on buffer ownership.)
>>
>> It may be possible to enhance rsockets with MSG_ZEROCOPY or io_uring extensions to enable zero-copy for large transfers, but that's not something I've looked at.  True zero copy may require combining MSG_ZEROCOPY with riowrite, but then that moves further away from using traditional socket calls.
> Thanks, Sean.
>
> One thing to mention is that QEMU has QIO_CHANNEL_WRITE_FLAG_ZERO_COPY,
> which already supports MSG_ZEROCOPY but only on sender side, and only if
> when multifd is enabled, because it requires page pinning and alignments,
> while it's more challenging to pin a random buffer than a guest page.
>
> Nobody moved on yet with zerocopy recv for TCP; there might be similar
> challenges that normal socket APIs may not work easily on top of current
> iochannel design, but I don't know well to say..
>
> Not sure whether it means there can be a shared goal with QEMU ultimately
> supporting better zerocopy via either TCP or RDMA.  If that's true, maybe
> there's chance we can move towards rsocket with all the above facilities,
> meanwhile RDMA can, ideally, run similiarly like TCP with the same (to be
> enhanced..) iochannel API, so that it can do zerocopy on both sides with
> either transport.
>
What about the testing solution that I mentioned?

Does that satisfy your concerns? Or is there still a gap here that needs 
to be met?

- Michael

Peter Xu Oct. 3, 2024, 9:43 p.m. UTC | #33

On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
> What about the testing solution that I mentioned?
> 
> Does that satisfy your concerns? Or is there still a gap here that needs to
> be met?

I think such testing framework would be helpful, especially if we can kick
it off in CI when preparing pull requests, then we can make sure nothing
will break RDMA easily.

Meanwhile, we still need people committed to this and actively maintain it,
who knows the rdma code well.

Thanks,

Michael Galaxy Oct. 4, 2024, 2:04 p.m. UTC | #34

On 10/3/24 16:43, Peter Xu wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
>> What about the testing solution that I mentioned?
>>
>> Does that satisfy your concerns? Or is there still a gap here that needs to
>> be met?
> I think such testing framework would be helpful, especially if we can kick
> it off in CI when preparing pull requests, then we can make sure nothing
> will break RDMA easily.
>
> Meanwhile, we still need people committed to this and actively maintain it,
> who knows the rdma code well.
>
> Thanks,
>

OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test 
along these lines that would ensure that future RDMA breakages are 
detected more easily?

What do you think?

- Michael

Yu Zhang Oct. 7, 2024, 8:47 a.m. UTC | #35

Sure, as we talked at the KVM Forum, a possible approach is to set up
two VMs on a physical host, configure the SoftRoCE, and run the
migration test in two nested VMs to ensure that the migration data
traffic goes through the emulated RDMA hardware. I will continue with
this and let you know.


On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
>
>
> On 10/3/24 16:43, Peter Xu wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> >
> > On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
> >> What about the testing solution that I mentioned?
> >>
> >> Does that satisfy your concerns? Or is there still a gap here that needs to
> >> be met?
> > I think such testing framework would be helpful, especially if we can kick
> > it off in CI when preparing pull requests, then we can make sure nothing
> > will break RDMA easily.
> >
> > Meanwhile, we still need people committed to this and actively maintain it,
> > who knows the rdma code well.
> >
> > Thanks,
> >
>
> OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test
> along these lines that would ensure that future RDMA breakages are
> detected more easily?
>
> What do you think?
>
> - Michael
>

Michael Galaxy Oct. 7, 2024, 1:45 p.m. UTC | #36

Hi,

On 10/7/24 03:47, Yu Zhang wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> Sure, as we talked at the KVM Forum, a possible approach is to set up
> two VMs on a physical host, configure the SoftRoCE, and run the
> migration test in two nested VMs to ensure that the migration data
> traffic goes through the emulated RDMA hardware. I will continue with
> this and let you know.
>
Acknowledged. Do share if you have any problems with it, like if it has 
compatibility issues
or if we need a different solution. We're open to change.

I'm not familiar with the "current state" of this or how well it would 
even work.

- Michael


> On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
>>
>> On 10/3/24 16:43, Peter Xu wrote:
>>> !-------------------------------------------------------------------|
>>>     This Message Is From an External Sender
>>>     This message came from outside your organization.
>>> |-------------------------------------------------------------------!
>>>
>>> On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
>>>> What about the testing solution that I mentioned?
>>>>
>>>> Does that satisfy your concerns? Or is there still a gap here that needs to
>>>> be met?
>>> I think such testing framework would be helpful, especially if we can kick
>>> it off in CI when preparing pull requests, then we can make sure nothing
>>> will break RDMA easily.
>>>
>>> Meanwhile, we still need people committed to this and actively maintain it,
>>> who knows the rdma code well.
>>>
>>> Thanks,
>>>
>> OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test
>> along these lines that would ensure that future RDMA breakages are
>> detected more easily?
>>
>> What do you think?
>>
>> - Michael
>>

Leon Romanovsky Oct. 7, 2024, 6:15 p.m. UTC | #37

On Mon, Oct 07, 2024 at 08:45:07AM -0500, Michael Galaxy wrote:
> Hi,
> 
> On 10/7/24 03:47, Yu Zhang wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> > 
> > Sure, as we talked at the KVM Forum, a possible approach is to set up
> > two VMs on a physical host, configure the SoftRoCE, and run the
> > migration test in two nested VMs to ensure that the migration data
> > traffic goes through the emulated RDMA hardware. I will continue with
> > this and let you know.
> > 
> Acknowledged. Do share if you have any problems with it, like if it has
> compatibility issues
> or if we need a different solution. We're open to change.
> 
> I'm not familiar with the "current state" of this or how well it would even
> work.

Any compatibility issue between versions of RXE (SoftRoCE) or between
RXE and real devices is a bug in RXE, which should be fixed.

RXE is expected to be compatible with rest RoCE devices, both virtual
and physical.

Thanks

> 
> - Michael
> 
> 
> > On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
> > > 
> > > On 10/3/24 16:43, Peter Xu wrote:
> > > > !-------------------------------------------------------------------|
> > > >     This Message Is From an External Sender
> > > >     This message came from outside your organization.
> > > > |-------------------------------------------------------------------!
> > > > 
> > > > On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
> > > > > What about the testing solution that I mentioned?
> > > > > 
> > > > > Does that satisfy your concerns? Or is there still a gap here that needs to
> > > > > be met?
> > > > I think such testing framework would be helpful, especially if we can kick
> > > > it off in CI when preparing pull requests, then we can make sure nothing
> > > > will break RDMA easily.
> > > > 
> > > > Meanwhile, we still need people committed to this and actively maintain it,
> > > > who knows the rdma code well.
> > > > 
> > > > Thanks,
> > > > 
> > > OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test
> > > along these lines that would ensure that future RDMA breakages are
> > > detected more easily?
> > > 
> > > What do you think?
> > > 
> > > - Michael
> > > 
>

Zhu Yanjun Oct. 8, 2024, 9:31 a.m. UTC | #38

在 2024/10/8 2:15, Leon Romanovsky 写道:
> On Mon, Oct 07, 2024 at 08:45:07AM -0500, Michael Galaxy wrote:
>> Hi,
>>
>> On 10/7/24 03:47, Yu Zhang wrote:
>>> !-------------------------------------------------------------------|
>>>     This Message Is From an External Sender
>>>     This message came from outside your organization.
>>> |-------------------------------------------------------------------!
>>>
>>> Sure, as we talked at the KVM Forum, a possible approach is to set up
>>> two VMs on a physical host, configure the SoftRoCE, and run the
>>> migration test in two nested VMs to ensure that the migration data
>>> traffic goes through the emulated RDMA hardware. I will continue with
>>> this and let you know.
>>>
>> Acknowledged. Do share if you have any problems with it, like if it has
>> compatibility issues
>> or if we need a different solution. We're open to change.
>>
>> I'm not familiar with the "current state" of this or how well it would even
>> work.
> 
> Any compatibility issue between versions of RXE (SoftRoCE) or between
> RXE and real devices is a bug in RXE, which should be fixed.
> 
> RXE is expected to be compatible with rest RoCE devices, both virtual
> and physical.

 From my tests, about physical RoCE devices, for example, Nvidia MLX5 
and intel E810 (iRDMA), if RDMA feature is disabled on those devices. 
RXE can work well with them.

About Virtual devices, most virtual devices can work well with RXE, for 
example,bonding, veth. I have done a lot of tests with them.

If some virtual devices can not work well with RXE, please share the 
error messages in RDMA maillist.

Zhu Yanjun

> 
> Thanks
> 
>>
>> - Michael
>>
>>
>>> On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
>>>>
>>>> On 10/3/24 16:43, Peter Xu wrote:
>>>>> !-------------------------------------------------------------------|
>>>>>      This Message Is From an External Sender
>>>>>      This message came from outside your organization.
>>>>> |-------------------------------------------------------------------!
>>>>>
>>>>> On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
>>>>>> What about the testing solution that I mentioned?
>>>>>>
>>>>>> Does that satisfy your concerns? Or is there still a gap here that needs to
>>>>>> be met?
>>>>> I think such testing framework would be helpful, especially if we can kick
>>>>> it off in CI when preparing pull requests, then we can make sure nothing
>>>>> will break RDMA easily.
>>>>>
>>>>> Meanwhile, we still need people committed to this and actively maintain it,
>>>>> who knows the rdma code well.
>>>>>
>>>>> Thanks,
>>>>>
>>>> OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test
>>>> along these lines that would ensure that future RDMA breakages are
>>>> detected more easily?
>>>>
>>>> What do you think?
>>>>
>>>> - Michael
>>>>
>>

Michael Galaxy Oct. 23, 2024, 1:42 p.m. UTC | #39

Hi All,

This is just a heads up: I will be changing employment soon, so my 
Akamai email address will cease to operate this week.

My personal email: michael@flatgalaxy.com. I'll re-subscribe later once 
I have come back online to work soon.

Thanks!

- Michael

On 10/7/24 08:45, Michael Galaxy wrote:
> Hi,
>
> On 10/7/24 03:47, Yu Zhang wrote:
>> !-------------------------------------------------------------------|
>>    This Message Is From an External Sender
>>    This message came from outside your organization.
>> |-------------------------------------------------------------------!
>>
>> Sure, as we talked at the KVM Forum, a possible approach is to set up
>> two VMs on a physical host, configure the SoftRoCE, and run the
>> migration test in two nested VMs to ensure that the migration data
>> traffic goes through the emulated RDMA hardware. I will continue with
>> this and let you know.
>>
> Acknowledged. Do share if you have any problems with it, like if it 
> has compatibility issues
> or if we need a different solution. We're open to change.
>
> I'm not familiar with the "current state" of this or how well it would 
> even work.
>
> - Michael
>
>
>> On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> 
>> wrote:
>>>
>>> On 10/3/24 16:43, Peter Xu wrote:
>>>> !-------------------------------------------------------------------|
>>>>     This Message Is From an External Sender
>>>>     This message came from outside your organization.
>>>> |-------------------------------------------------------------------!
>>>>
>>>> On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
>>>>> What about the testing solution that I mentioned?
>>>>>
>>>>> Does that satisfy your concerns? Or is there still a gap here that 
>>>>> needs to
>>>>> be met?
>>>> I think such testing framework would be helpful, especially if we 
>>>> can kick
>>>> it off in CI when preparing pull requests, then we can make sure 
>>>> nothing
>>>> will break RDMA easily.
>>>>
>>>> Meanwhile, we still need people committed to this and actively 
>>>> maintain it,
>>>> who knows the rdma code well.
>>>>
>>>> Thanks,
>>>>
>>> OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test
>>> along these lines that would ensure that future RDMA breakages are
>>> detected more easily?
>>>
>>> What do you think?
>>>
>>> - Michael
>>>