mbox series

[RFC,0/7] migration/multifd: Introduce storage slots

Message ID 20240620212111.29319-1-farosas@suse.de (mailing list archive)
Headers show
Series migration/multifd: Introduce storage slots | expand

Message

Fabiano Rosas June 20, 2024, 9:21 p.m. UTC
Hi folks,

First of all, apologies for the roughness of the series. I'm off for
the next couple of weeks and wanted to put something together early
for your consideration.

This series is a refactoring (based on an earlier, off-list
attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
the multifd core. If we're going to add support for more data types to
multifd, we first need to clean that up.

This time around this work was prompted by Maciej's series[1]. I see
you're having to add a bunch of is_device_state checks to work around
the rigidity of the code.

Aside from the VFIO work, there is also the intent (coming back from
Juan's ideas) to make multifd the default code path for migration,
which will have to include the vmstate migration and anything else we
put on the stream via QEMUFile.

I have long since been bothered by having 'pages' sprinkled all over
the code, so I might be coming at this with a bit of a narrow focus,
but I believe in order to support more types of payloads in multifd,
we need to first allow the scheduling at multifd_send_pages() to be
independent of MultiFDPages_t. So here it is. Let me know what you
think.

(as I said, I'll be off for a couple of weeks, so feel free to
incorporate any of this code if it's useful. Or to ignore it
completely).

CI run: https://gitlab.com/farosas/qemu/-/pipelines/1340992028

0- https://github.com/farosas/qemu/commits/multifd-packet-cleanups/
1- https://lore.kernel.org/r/cover.1718717584.git.maciej.szmigiero@oracle.com

Fabiano Rosas (7):
  migration/multifd: Reduce access to p->pages
  migration/multifd: Pass in MultiFDPages_t to file_write_ramblock_iov
  migration/multifd: Replace p->pages with an opaque pointer
  migration/multifd: Move pages accounting into
    multifd_send_zero_page_detect()
  migration/multifd: Isolate ram pages packet data
  migration/multifd: Move payload storage out of the channel parameters
  migration/multifd: Hide multifd slots implementation

 migration/file.c              |   3 +-
 migration/file.h              |   2 +-
 migration/multifd-qpl.c       |   8 +-
 migration/multifd-uadk.c      |   9 +-
 migration/multifd-zero-page.c |   6 +-
 migration/multifd-zlib.c      |   4 +-
 migration/multifd-zstd.c      |   4 +-
 migration/multifd.c           | 263 ++++++++++++++++++++++++----------
 migration/multifd.h           |  28 +++-
 migration/ram.c               |   1 +
 10 files changed, 232 insertions(+), 96 deletions(-)


base-commit: 79e6ec66ba1067a135394a330fec14b50cf49534

Comments

Maciej S. Szmigiero June 21, 2024, 2:44 p.m. UTC | #1
Hi Fabiano,

On 20.06.2024 23:21, Fabiano Rosas wrote:
> Hi folks,
> 
> First of all, apologies for the roughness of the series. I'm off for
> the next couple of weeks and wanted to put something together early
> for your consideration.
> 
> This series is a refactoring (based on an earlier, off-list
> attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
> the multifd core. If we're going to add support for more data types to
> multifd, we first need to clean that up.
> 
> This time around this work was prompted by Maciej's series[1]. I see
> you're having to add a bunch of is_device_state checks to work around
> the rigidity of the code.
> 
> Aside from the VFIO work, there is also the intent (coming back from
> Juan's ideas) to make multifd the default code path for migration,
> which will have to include the vmstate migration and anything else we
> put on the stream via QEMUFile.
> 
> I have long since been bothered by having 'pages' sprinkled all over
> the code, so I might be coming at this with a bit of a narrow focus,
> but I believe in order to support more types of payloads in multifd,
> we need to first allow the scheduling at multifd_send_pages() to be
> independent of MultiFDPages_t. So here it is. Let me know what you
> think.

Thanks for the patch set, I quickly glanced at these patches and they
definitely make sense to me.

I guess its latest version could be found in the repo at [2] since
that's where the CI run mentioned below took it from?

> (as I said, I'll be off for a couple of weeks, so feel free to
> incorporate any of this code if it's useful. Or to ignore it
> completely).

I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
feature freeze in about a month, correct?

> CI run: https://gitlab.com/farosas/qemu/-/pipelines/1340992028
> 
> 0- https://github.com/farosas/qemu/commits/multifd-packet-cleanups/
> 1- https://lore.kernel.org/r/cover.1718717584.git.maciej.szmigiero@oracle.com

[2]: https://gitlab.com/farosas/qemu/-/commits/multifd-pages-decouple

Thanks,
Maciej
Fabiano Rosas June 21, 2024, 3:04 p.m. UTC | #2
"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> Hi Fabiano,
>
> On 20.06.2024 23:21, Fabiano Rosas wrote:
>> Hi folks,
>> 
>> First of all, apologies for the roughness of the series. I'm off for
>> the next couple of weeks and wanted to put something together early
>> for your consideration.
>> 
>> This series is a refactoring (based on an earlier, off-list
>> attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
>> the multifd core. If we're going to add support for more data types to
>> multifd, we first need to clean that up.
>> 
>> This time around this work was prompted by Maciej's series[1]. I see
>> you're having to add a bunch of is_device_state checks to work around
>> the rigidity of the code.
>> 
>> Aside from the VFIO work, there is also the intent (coming back from
>> Juan's ideas) to make multifd the default code path for migration,
>> which will have to include the vmstate migration and anything else we
>> put on the stream via QEMUFile.
>> 
>> I have long since been bothered by having 'pages' sprinkled all over
>> the code, so I might be coming at this with a bit of a narrow focus,
>> but I believe in order to support more types of payloads in multifd,
>> we need to first allow the scheduling at multifd_send_pages() to be
>> independent of MultiFDPages_t. So here it is. Let me know what you
>> think.
>
> Thanks for the patch set, I quickly glanced at these patches and they
> definitely make sense to me.
>
> I guess its latest version could be found in the repo at [2] since
> that's where the CI run mentioned below took it from?

Yes

>
>> (as I said, I'll be off for a couple of weeks, so feel free to
>> incorporate any of this code if it's useful. Or to ignore it
>> completely).
>
> I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
> feature freeze in about a month, correct?
>

For general code improvements like this I'm not thinking about QEMU
releases at all. But this series is not super complex, so I could
imagine we merging it in time for 9.1 if we reach an agreement.

Are you thinking your series might miss the target? Or have concerns
over the stability of the refactoring? We can within reason merge code
based on the current framework and improve things on top, we already did
something similar when merging zero-page support. I don't have an issue
with that.


>> CI run: https://gitlab.com/farosas/qemu/-/pipelines/1340992028
>> 
>> 0- https://github.com/farosas/qemu/commits/multifd-packet-cleanups/
>> 1- https://lore.kernel.org/r/cover.1718717584.git.maciej.szmigiero@oracle.com
>
> [2]: https://gitlab.com/farosas/qemu/-/commits/multifd-pages-decouple
>
> Thanks,
> Maciej
Maciej S. Szmigiero June 21, 2024, 3:31 p.m. UTC | #3
On 21.06.2024 17:04, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> Hi Fabiano,
>>
>> On 20.06.2024 23:21, Fabiano Rosas wrote:
>>> Hi folks,
>>>
>>> First of all, apologies for the roughness of the series. I'm off for
>>> the next couple of weeks and wanted to put something together early
>>> for your consideration.
>>>
>>> This series is a refactoring (based on an earlier, off-list
>>> attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
>>> the multifd core. If we're going to add support for more data types to
>>> multifd, we first need to clean that up.
>>>
>>> This time around this work was prompted by Maciej's series[1]. I see
>>> you're having to add a bunch of is_device_state checks to work around
>>> the rigidity of the code.
>>>
>>> Aside from the VFIO work, there is also the intent (coming back from
>>> Juan's ideas) to make multifd the default code path for migration,
>>> which will have to include the vmstate migration and anything else we
>>> put on the stream via QEMUFile.
>>>
>>> I have long since been bothered by having 'pages' sprinkled all over
>>> the code, so I might be coming at this with a bit of a narrow focus,
>>> but I believe in order to support more types of payloads in multifd,
>>> we need to first allow the scheduling at multifd_send_pages() to be
>>> independent of MultiFDPages_t. So here it is. Let me know what you
>>> think.
>>
>> Thanks for the patch set, I quickly glanced at these patches and they
>> definitely make sense to me.
>>
(..)
>>> (as I said, I'll be off for a couple of weeks, so feel free to
>>> incorporate any of this code if it's useful. Or to ignore it
>>> completely).
>>
>> I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
>> feature freeze in about a month, correct?
>>
> 
> For general code improvements like this I'm not thinking about QEMU
> releases at all. But this series is not super complex, so I could
> imagine we merging it in time for 9.1 if we reach an agreement.
> 
> Are you thinking your series might miss the target? Or have concerns
> over the stability of the refactoring? We can within reason merge code
> based on the current framework and improve things on top, we already did
> something similar when merging zero-page support. I don't have an issue
> with that.

The reason that I asked whether you are targeting 9.1 is because my
patch set is definitely targeting that release.

At the same time my patch set will need to be rebased/refactored on top
of this patch set if it is supposed to be merged for 9.1 too.

If this patch set gets merged quickly that's not really a problem.

On the other hand, if another iteration(s) is/are needed AND you are
not available in the coming weeks to work on them then there's a
question whether we will make the required deadline.

Thanks,
Maciej
Peter Xu June 21, 2024, 3:56 p.m. UTC | #4
On Fri, Jun 21, 2024 at 05:31:54PM +0200, Maciej S. Szmigiero wrote:
> On 21.06.2024 17:04, Fabiano Rosas wrote:
> > "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> > 
> > > Hi Fabiano,
> > > 
> > > On 20.06.2024 23:21, Fabiano Rosas wrote:
> > > > Hi folks,
> > > > 
> > > > First of all, apologies for the roughness of the series. I'm off for
> > > > the next couple of weeks and wanted to put something together early
> > > > for your consideration.
> > > > 
> > > > This series is a refactoring (based on an earlier, off-list
> > > > attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
> > > > the multifd core. If we're going to add support for more data types to
> > > > multifd, we first need to clean that up.
> > > > 
> > > > This time around this work was prompted by Maciej's series[1]. I see
> > > > you're having to add a bunch of is_device_state checks to work around
> > > > the rigidity of the code.
> > > > 
> > > > Aside from the VFIO work, there is also the intent (coming back from
> > > > Juan's ideas) to make multifd the default code path for migration,
> > > > which will have to include the vmstate migration and anything else we
> > > > put on the stream via QEMUFile.
> > > > 
> > > > I have long since been bothered by having 'pages' sprinkled all over
> > > > the code, so I might be coming at this with a bit of a narrow focus,
> > > > but I believe in order to support more types of payloads in multifd,
> > > > we need to first allow the scheduling at multifd_send_pages() to be
> > > > independent of MultiFDPages_t. So here it is. Let me know what you
> > > > think.
> > > 
> > > Thanks for the patch set, I quickly glanced at these patches and they
> > > definitely make sense to me.
> > > 
> (..)
> > > > (as I said, I'll be off for a couple of weeks, so feel free to
> > > > incorporate any of this code if it's useful. Or to ignore it
> > > > completely).
> > > 
> > > I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
> > > feature freeze in about a month, correct?
> > > 
> > 
> > For general code improvements like this I'm not thinking about QEMU
> > releases at all. But this series is not super complex, so I could
> > imagine we merging it in time for 9.1 if we reach an agreement.
> > 
> > Are you thinking your series might miss the target? Or have concerns
> > over the stability of the refactoring? We can within reason merge code
> > based on the current framework and improve things on top, we already did
> > something similar when merging zero-page support. I don't have an issue
> > with that.
> 
> The reason that I asked whether you are targeting 9.1 is because my
> patch set is definitely targeting that release.
> 
> At the same time my patch set will need to be rebased/refactored on top
> of this patch set if it is supposed to be merged for 9.1 too.
> 
> If this patch set gets merged quickly that's not really a problem.
> 
> On the other hand, if another iteration(s) is/are needed AND you are
> not available in the coming weeks to work on them then there's a
> question whether we will make the required deadline.

I think it's a bit rush to merge the vfio series in this release.  I'm not
sure it has enough time to be properly reviewed, reposted, retested, etc.

I've already started looking at it, and so far I think I have doubt not
only on agreement with Fabiano on the device_state thing which I prefer to
avoid, but also I'm thinking of any possible way to at least make the
worker threads generic too: a direct impact could be vDPA in the near
future if anyone cared, while I don't want modules to create threads
randomly during migration.

Meanwhile I'm also thinking whether that "the thread needs to dump all
data, and during iteration we can't do that" is the good reason to not
support that during iterations.

I didn't yet reply because I don't think I think all things through, but
I'll get there.

So I'm not saying that the design is problematic, but IMHO it's just not
mature enough to assume it will land in 9.1, considering it's still a large
one, and the first non-rfc version just posted two days ago.

Thanks,
Maciej S. Szmigiero June 21, 2024, 5:40 p.m. UTC | #5
On 21.06.2024 17:56, Peter Xu wrote:
> On Fri, Jun 21, 2024 at 05:31:54PM +0200, Maciej S. Szmigiero wrote:
>> On 21.06.2024 17:04, Fabiano Rosas wrote:
>>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>>>
>>>> Hi Fabiano,
>>>>
>>>> On 20.06.2024 23:21, Fabiano Rosas wrote:
>>>>> Hi folks,
>>>>>
>>>>> First of all, apologies for the roughness of the series. I'm off for
>>>>> the next couple of weeks and wanted to put something together early
>>>>> for your consideration.
>>>>>
>>>>> This series is a refactoring (based on an earlier, off-list
>>>>> attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
>>>>> the multifd core. If we're going to add support for more data types to
>>>>> multifd, we first need to clean that up.
>>>>>
>>>>> This time around this work was prompted by Maciej's series[1]. I see
>>>>> you're having to add a bunch of is_device_state checks to work around
>>>>> the rigidity of the code.
>>>>>
>>>>> Aside from the VFIO work, there is also the intent (coming back from
>>>>> Juan's ideas) to make multifd the default code path for migration,
>>>>> which will have to include the vmstate migration and anything else we
>>>>> put on the stream via QEMUFile.
>>>>>
>>>>> I have long since been bothered by having 'pages' sprinkled all over
>>>>> the code, so I might be coming at this with a bit of a narrow focus,
>>>>> but I believe in order to support more types of payloads in multifd,
>>>>> we need to first allow the scheduling at multifd_send_pages() to be
>>>>> independent of MultiFDPages_t. So here it is. Let me know what you
>>>>> think.
>>>>
>>>> Thanks for the patch set, I quickly glanced at these patches and they
>>>> definitely make sense to me.
>>>>
>> (..)
>>>>> (as I said, I'll be off for a couple of weeks, so feel free to
>>>>> incorporate any of this code if it's useful. Or to ignore it
>>>>> completely).
>>>>
>>>> I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
>>>> feature freeze in about a month, correct?
>>>>
>>>
>>> For general code improvements like this I'm not thinking about QEMU
>>> releases at all. But this series is not super complex, so I could
>>> imagine we merging it in time for 9.1 if we reach an agreement.
>>>
>>> Are you thinking your series might miss the target? Or have concerns
>>> over the stability of the refactoring? We can within reason merge code
>>> based on the current framework and improve things on top, we already did
>>> something similar when merging zero-page support. I don't have an issue
>>> with that.
>>
>> The reason that I asked whether you are targeting 9.1 is because my
>> patch set is definitely targeting that release.
>>
>> At the same time my patch set will need to be rebased/refactored on top
>> of this patch set if it is supposed to be merged for 9.1 too.
>>
>> If this patch set gets merged quickly that's not really a problem.
>>
>> On the other hand, if another iteration(s) is/are needed AND you are
>> not available in the coming weeks to work on them then there's a
>> question whether we will make the required deadline.
> 
> I think it's a bit rush to merge the vfio series in this release.  I'm not
> sure it has enough time to be properly reviewed, reposted, retested, etc.
> 
> I've already started looking at it, and so far I think I have doubt not
> only on agreement with Fabiano on the device_state thing which I prefer to
> avoid, but also I'm thinking of any possible way to at least make the
> worker threads generic too: a direct impact could be vDPA in the near
> future if anyone cared, while I don't want modules to create threads
> randomly during migration.
> 
> Meanwhile I'm also thinking whether that "the thread needs to dump all
> data, and during iteration we can't do that" is the good reason to not
> support that during iterations.
> 
> I didn't yet reply because I don't think I think all things through, but
> I'll get there.
> 
> So I'm not saying that the design is problematic, but IMHO it's just not
> mature enough to assume it will land in 9.1, considering it's still a large
> one, and the first non-rfc version just posted two days ago.


The RFC version was posted more than 2 months ago.

It has received some review comments from multiple people,
all of which were addressed in this patch set version.

I have not received any further comments during these 2 months, so I thought
the overall design is considered okay - if anything, there might be minor
code comments/issues but these can easily be improved/fixed in the 5 weeks
remaining to the soft code freeze for 9.1.


If anything, I think that the VM live phase (non-downtime) transfers
functionality should be deferred until 9.2 because:
* It wasn't a part of the RFC so even if implemented today would get much
less testing overall,

* It's orthogonal to the switchover time device state transfer functionality
introduced by this patch set and could be added on top of that without
changing the wire protocol for switchover time device state transfers,

* It doesn't impact the switchover downtime so in this case 9.1 would
already contain all what's necessary to improve it.


Thanks,
Maciej
Peter Xu June 21, 2024, 8:54 p.m. UTC | #6
On Fri, Jun 21, 2024 at 07:40:01PM +0200, Maciej S. Szmigiero wrote:
> On 21.06.2024 17:56, Peter Xu wrote:
> > On Fri, Jun 21, 2024 at 05:31:54PM +0200, Maciej S. Szmigiero wrote:
> > > On 21.06.2024 17:04, Fabiano Rosas wrote:
> > > > "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> > > > 
> > > > > Hi Fabiano,
> > > > > 
> > > > > On 20.06.2024 23:21, Fabiano Rosas wrote:
> > > > > > Hi folks,
> > > > > > 
> > > > > > First of all, apologies for the roughness of the series. I'm off for
> > > > > > the next couple of weeks and wanted to put something together early
> > > > > > for your consideration.
> > > > > > 
> > > > > > This series is a refactoring (based on an earlier, off-list
> > > > > > attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
> > > > > > the multifd core. If we're going to add support for more data types to
> > > > > > multifd, we first need to clean that up.
> > > > > > 
> > > > > > This time around this work was prompted by Maciej's series[1]. I see
> > > > > > you're having to add a bunch of is_device_state checks to work around
> > > > > > the rigidity of the code.
> > > > > > 
> > > > > > Aside from the VFIO work, there is also the intent (coming back from
> > > > > > Juan's ideas) to make multifd the default code path for migration,
> > > > > > which will have to include the vmstate migration and anything else we
> > > > > > put on the stream via QEMUFile.
> > > > > > 
> > > > > > I have long since been bothered by having 'pages' sprinkled all over
> > > > > > the code, so I might be coming at this with a bit of a narrow focus,
> > > > > > but I believe in order to support more types of payloads in multifd,
> > > > > > we need to first allow the scheduling at multifd_send_pages() to be
> > > > > > independent of MultiFDPages_t. So here it is. Let me know what you
> > > > > > think.
> > > > > 
> > > > > Thanks for the patch set, I quickly glanced at these patches and they
> > > > > definitely make sense to me.
> > > > > 
> > > (..)
> > > > > > (as I said, I'll be off for a couple of weeks, so feel free to
> > > > > > incorporate any of this code if it's useful. Or to ignore it
> > > > > > completely).
> > > > > 
> > > > > I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
> > > > > feature freeze in about a month, correct?
> > > > > 
> > > > 
> > > > For general code improvements like this I'm not thinking about QEMU
> > > > releases at all. But this series is not super complex, so I could
> > > > imagine we merging it in time for 9.1 if we reach an agreement.
> > > > 
> > > > Are you thinking your series might miss the target? Or have concerns
> > > > over the stability of the refactoring? We can within reason merge code
> > > > based on the current framework and improve things on top, we already did
> > > > something similar when merging zero-page support. I don't have an issue
> > > > with that.
> > > 
> > > The reason that I asked whether you are targeting 9.1 is because my
> > > patch set is definitely targeting that release.
> > > 
> > > At the same time my patch set will need to be rebased/refactored on top
> > > of this patch set if it is supposed to be merged for 9.1 too.
> > > 
> > > If this patch set gets merged quickly that's not really a problem.
> > > 
> > > On the other hand, if another iteration(s) is/are needed AND you are
> > > not available in the coming weeks to work on them then there's a
> > > question whether we will make the required deadline.
> > 
> > I think it's a bit rush to merge the vfio series in this release.  I'm not
> > sure it has enough time to be properly reviewed, reposted, retested, etc.
> > 
> > I've already started looking at it, and so far I think I have doubt not
> > only on agreement with Fabiano on the device_state thing which I prefer to
> > avoid, but also I'm thinking of any possible way to at least make the
> > worker threads generic too: a direct impact could be vDPA in the near
> > future if anyone cared, while I don't want modules to create threads
> > randomly during migration.
> > 
> > Meanwhile I'm also thinking whether that "the thread needs to dump all
> > data, and during iteration we can't do that" is the good reason to not
> > support that during iterations.
> > 
> > I didn't yet reply because I don't think I think all things through, but
> > I'll get there.
> > 
> > So I'm not saying that the design is problematic, but IMHO it's just not
> > mature enough to assume it will land in 9.1, considering it's still a large
> > one, and the first non-rfc version just posted two days ago.
> 
> 
> The RFC version was posted more than 2 months ago.
> 
> It has received some review comments from multiple people,
> all of which were addressed in this patch set version.

I thought it was mostly me who reviewed it, am I right?  Or do you have
other thread that has such discussion happening, and the design review has
properly done and reached an agreement?

IMHO that is also not how RFC works.

It doesn't work like "if RFC didn't got NACKed, a maintainer should merge
v1 when someone posts it".  Instead RFC should only mean these at least to
me: "(1) please review this from high level, things can drastically change;
(2) please don't merge this, because it is not for merging but for getting
comments."

Beyond, it doesn't imply anything for what happens after the RFC series.

> 
> I have not received any further comments during these 2 months, so I thought
> the overall design is considered okay - if anything, there might be minor
> code comments/issues but these can easily be improved/fixed in the 5 weeks
> remaining to the soft code freeze for 9.1.

The latest email in that thread (assuming this one is what you're referring
to) is:

https://lore.kernel.org/qemu-devel/f67fcc88-aaf6-43f7-9287-90cbd7495ba1@nvidia.com/#t

I didn't hear anything from Avihai yet, neither did I think we reached an
complete agreement on the whole design.

> 
> 
> If anything, I think that the VM live phase (non-downtime) transfers
> functionality should be deferred until 9.2 because:
> * It wasn't a part of the RFC so even if implemented today would get much
> less testing overall,

IMO it really depends on what was proposed.  Anyone who send patches should
definitely test whatever the patchset provides.  If that patchset includes
the iteration changes then that needs to be tested by the submitter.

> 
> * It's orthogonal to the switchover time device state transfer functionality
> introduced by this patch set and could be added on top of that without
> changing the wire protocol for switchover time device state transfers,

AFAICT it will affect the wire protocol?  If the dest QEMU contains your
patcheset to be the old version of QEMU, then the threads will only be
created at the switchover phase, and it won't be ready to take whatever
data sent from a new QEMU which may allow migrating VFIO iteration data,
who may expect these VFIO data to be passed over via multifd channels
even before the switchover.

It can only be compatible at least when the threads are created before
iteration starts on dest side, and I didn't yet check the packet headers
and other stuffs.

I think that can be a sweet spot where maybe you can land this series
sooner, but it also gets ready for anyone who wants to further extend that
to iteration phase easily.  Not sure whether it'll be easily feasible by
just moving the thread creations earlier.

> 
> * It doesn't impact the switchover downtime so in this case 9.1 would
> already contain all what's necessary to improve it.

Yes it won't, but IMHO that's not an issue.

Since Fabiano is going on a short break soon, I think I'll be the only one
review it.  I'll try my best, but still I can't guarantee it will be able
to land in 9.1, and this is not the only thing I'll need to do next week..

I appreciated a lot you worked out VFIO on top of multifd, because IMHO
that's really the right direction.  However even with that, I don't think
the whole design is yet fully settled, not to mention the details. And that
implies it may miss 9.1.

Thanks,
Maciej S. Szmigiero June 23, 2024, 8:25 p.m. UTC | #7
On 21.06.2024 22:54, Peter Xu wrote:
> On Fri, Jun 21, 2024 at 07:40:01PM +0200, Maciej S. Szmigiero wrote:
>> On 21.06.2024 17:56, Peter Xu wrote:
>>> On Fri, Jun 21, 2024 at 05:31:54PM +0200, Maciej S. Szmigiero wrote:
>>>> On 21.06.2024 17:04, Fabiano Rosas wrote:
>>>>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>>>>>
>>>>>> Hi Fabiano,
>>>>>>
>>>>>> On 20.06.2024 23:21, Fabiano Rosas wrote:
>>>>>>> Hi folks,
>>>>>>>
>>>>>>> First of all, apologies for the roughness of the series. I'm off for
>>>>>>> the next couple of weeks and wanted to put something together early
>>>>>>> for your consideration.
>>>>>>>
>>>>>>> This series is a refactoring (based on an earlier, off-list
>>>>>>> attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
>>>>>>> the multifd core. If we're going to add support for more data types to
>>>>>>> multifd, we first need to clean that up.
>>>>>>>
>>>>>>> This time around this work was prompted by Maciej's series[1]. I see
>>>>>>> you're having to add a bunch of is_device_state checks to work around
>>>>>>> the rigidity of the code.
>>>>>>>
>>>>>>> Aside from the VFIO work, there is also the intent (coming back from
>>>>>>> Juan's ideas) to make multifd the default code path for migration,
>>>>>>> which will have to include the vmstate migration and anything else we
>>>>>>> put on the stream via QEMUFile.
>>>>>>>
>>>>>>> I have long since been bothered by having 'pages' sprinkled all over
>>>>>>> the code, so I might be coming at this with a bit of a narrow focus,
>>>>>>> but I believe in order to support more types of payloads in multifd,
>>>>>>> we need to first allow the scheduling at multifd_send_pages() to be
>>>>>>> independent of MultiFDPages_t. So here it is. Let me know what you
>>>>>>> think.
>>>>>>
>>>>>> Thanks for the patch set, I quickly glanced at these patches and they
>>>>>> definitely make sense to me.
>>>>>>
>>>> (..)
>>>>>>> (as I said, I'll be off for a couple of weeks, so feel free to
>>>>>>> incorporate any of this code if it's useful. Or to ignore it
>>>>>>> completely).
>>>>>>
>>>>>> I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
>>>>>> feature freeze in about a month, correct?
>>>>>>
>>>>>
>>>>> For general code improvements like this I'm not thinking about QEMU
>>>>> releases at all. But this series is not super complex, so I could
>>>>> imagine we merging it in time for 9.1 if we reach an agreement.
>>>>>
>>>>> Are you thinking your series might miss the target? Or have concerns
>>>>> over the stability of the refactoring? We can within reason merge code
>>>>> based on the current framework and improve things on top, we already did
>>>>> something similar when merging zero-page support. I don't have an issue
>>>>> with that.
>>>>
>>>> The reason that I asked whether you are targeting 9.1 is because my
>>>> patch set is definitely targeting that release.
>>>>
>>>> At the same time my patch set will need to be rebased/refactored on top
>>>> of this patch set if it is supposed to be merged for 9.1 too.
>>>>
>>>> If this patch set gets merged quickly that's not really a problem.
>>>>
>>>> On the other hand, if another iteration(s) is/are needed AND you are
>>>> not available in the coming weeks to work on them then there's a
>>>> question whether we will make the required deadline.
>>>
>>> I think it's a bit rush to merge the vfio series in this release.  I'm not
>>> sure it has enough time to be properly reviewed, reposted, retested, etc.
>>>
>>> I've already started looking at it, and so far I think I have doubt not
>>> only on agreement with Fabiano on the device_state thing which I prefer to
>>> avoid, but also I'm thinking of any possible way to at least make the
>>> worker threads generic too: a direct impact could be vDPA in the near
>>> future if anyone cared, while I don't want modules to create threads
>>> randomly during migration.
>>>
>>> Meanwhile I'm also thinking whether that "the thread needs to dump all
>>> data, and during iteration we can't do that" is the good reason to not
>>> support that during iterations.
>>>
>>> I didn't yet reply because I don't think I think all things through, but
>>> I'll get there.
>>>
>>> So I'm not saying that the design is problematic, but IMHO it's just not
>>> mature enough to assume it will land in 9.1, considering it's still a large
>>> one, and the first non-rfc version just posted two days ago.
>>
>>
>> The RFC version was posted more than 2 months ago.
>>
>> It has received some review comments from multiple people,
>> all of which were addressed in this patch set version.
> 
> I thought it was mostly me who reviewed it, am I right?  Or do you have
> other thread that has such discussion happening, and the design review has
> properly done and reached an agreement?

Daniel P. Berrangé also submitted a few comments: [1], [2], [3], [4], [5].
In fact, it is him who first suggested not having a new channel header
wire format or dedicated device state channels.

In addition to that, Avihai was also following our discussions: [6] and
he also looked privately at an early (but functioning) draft of these
patches well before the RFC was even publicly posted.

> IMHO that is also not how RFC works.
> 
> It doesn't work like "if RFC didn't got NACKed, a maintainer should merge
> v1 when someone posts it".  Instead RFC should only mean these at least to
> me: "(1) please review this from high level, things can drastically change;
> (2) please don't merge this, because it is not for merging but for getting
> comments."
> 
> Beyond, it doesn't imply anything for what happens after the RFC series.

That "RFC" marking on v0 was meant to signify it as a draft not suitable
to be merged immediately.
Much like marking a {pull,merge} request a draft on Git{Hub,Lab}.

docs/devel/submitting-a-patch.rst even says that:
> "RFC" means "Request For Comments" and is a statement that you don't
> intend for your patchset to be applied to master, but would like some
> review on it anyway.

>>
>> I have not received any further comments during these 2 months, so I thought
>> the overall design is considered okay - if anything, there might be minor
>> code comments/issues but these can easily be improved/fixed in the 5 weeks
>> remaining to the soft code freeze for 9.1.
> 
> The latest email in that thread (assuming this one is what you're referring
> to) is:
> 
> https://lore.kernel.org/qemu-devel/f67fcc88-aaf6-43f7-9287-90cbd7495ba1@nvidia.com/#t
> 
> I didn't hear anything from Avihai yet, neither did I think we reached an
> complete agreement on the whole design.

So what then is necessary to reach a "complete agreement on the whole design"?

Do you think organizing a brainstorming session somewhere (Zoom, etc.) would
help with that?

Although there is always a risk that such "10,000 foot" design turns out
to have significantly worse performance than this one - how a (sensible)
design will perform in real world testing is rather hard to predict in
advance.

The current design took a while to make sure we don't leave any possible
performance (downtime improvement) by mistake.

>>
>>
>> If anything, I think that the VM live phase (non-downtime) transfers
>> functionality should be deferred until 9.2 because:
>> * It wasn't a part of the RFC so even if implemented today would get much
>> less testing overall,
> 
> IMO it really depends on what was proposed.  Anyone who send patches should
> definitely test whatever the patchset provides.  If that patchset includes
> the iteration changes then that needs to be tested by the submitter.

I agree that anything proposed needs to be well tested before being
submitted.

>>
>> * It's orthogonal to the switchover time device state transfer functionality
>> introduced by this patch set and could be added on top of that without
>> changing the wire protocol for switchover time device state transfers,
> 
> AFAICT it will affect the wire protocol?  If the dest QEMU contains your
> patcheset to be the old version of QEMU, then the threads will only be
> created at the switchover phase, and it won't be ready to take whatever
> data sent from a new QEMU which may allow migrating VFIO iteration data,
> who may expect these VFIO data to be passed over via multifd channels
> even before the switchover.
> 
> It can only be compatible at least when the threads are created before
> iteration starts on dest side, and I didn't yet check the packet headers
> and other stuffs.
> 
> I think that can be a sweet spot where maybe you can land this series
> sooner, but it also gets ready for anyone who wants to further extend that
> to iteration phase easily.  Not sure whether it'll be easily feasible by
> just moving the thread creations earlier.


If someone is migrating data to an older QEMU version that does not
contain this patch set the source must have "x-migration-multifd-transfer"
VFIO device property set to false (the default value) for wire format
compatibility.

The same will go for VM live phase data - it will need to have some
additional setting which needs to be disabled for the wire format
to be compatible with older QEMU versions that do not understand the
such data transfer over multifd channels.

On the other hand, as I wrote in the cover letter, there's nothing
stopping a QEMU device driver from requiring different handling
(loading, etc.) of VM live phase data from the post-switchover data.

So loading such VM live phase data will need a new handler anyway.

> 
>>
>> * It doesn't impact the switchover downtime so in this case 9.1 would
>> already contain all what's necessary to improve it.
> 
> Yes it won't, but IMHO that's not an issue.
> 
> Since Fabiano is going on a short break soon, I think I'll be the only one
> review it.  I'll try my best, but still I can't guarantee it will be able
> to land in 9.1, and this is not the only thing I'll need to do next week..
> 
> I appreciated a lot you worked out VFIO on top of multifd, because IMHO
> that's really the right direction.  However even with that, I don't think
> the whole design is yet fully settled, not to mention the details. And that
> implies it may miss 9.1.

I appreciate your work and review Peter - it would be nice if we could
at least make some progress on this subject for 9.1.

> Thanks,
> 

Thanks,
Maciej

[1]: https://lore.kernel.org/qemu-devel/Zh-KF72Fe9oV6tfT@redhat.com/
[2]: https://lore.kernel.org/qemu-devel/Zh_6W8u3H4FmGS49@redhat.com/
[3]: https://lore.kernel.org/qemu-devel/ZiD4aLSre6qubuHr@redhat.com/
[4]: https://lore.kernel.org/qemu-devel/ZiJCSZvsekaO8dzO@redhat.com/
[5]: https://lore.kernel.org/qemu-devel/ZiJFU_QrOHVvwe4W@redhat.com/
[6]: https://lore.kernel.org/qemu-devel/7e855ccb-d5af-490f-94ab-61141fa30ba8@nvidia.com/
Peter Xu June 23, 2024, 8:45 p.m. UTC | #8
On Sun, Jun 23, 2024 at 10:25:05PM +0200, Maciej S. Szmigiero wrote:
> > I appreciated a lot you worked out VFIO on top of multifd, because IMHO
> > that's really the right direction.  However even with that, I don't think
> > the whole design is yet fully settled, not to mention the details. And that
> > implies it may miss 9.1.
> 
> I appreciate your work and review Peter - it would be nice if we could
> at least make some progress on this subject for 9.1.

Let's try and see, I didn't mean it won't hit 9.1 at all, it just feels
still challenging, but who knows!  I just don't want to give you that
feeling then right before softfreeze I start to say "it's not ready".

I think one reason it'll be more challenge is also since Fabiano will be
missing for weeks, and since you look like to agree on his RFC as a start,
it means it might be good idea we wait for his back and see how that goes
from there even if we can reach some consensus; it'll just be challenging
already.

I also left my (slightly lengthy) comment on the high level design of that
series here (I know you'll see that, but just to keep a record of this
discussion and link them together):

https://lore.kernel.org/r/ZniFH14DT6ycjbrL@x1n

Let's discuss there, let me know if something I just made it wrong, and
also if you're targeting landing part of the series you can try to
prioritize / provision what can already be landed and easier.

I actually have a wild idea where I don't know whether "switchover phase"
hooks are even needed.  I mean, currently save_live_iterate() allows
"returning 1 means all data ready".  Logically in switchover phase the
migration core could simply call save_live_iterate() until it returns 1.
Then switchover hooks do not need to exist at all.  I didn't check the
details, but if that works that'll simplify all register_savevm_live()
users, and also for VFIO it may mean iteration phase is more important too,
as when it's resolved it naturally works with switchover phase.  You can
ignore this idea because it can be too wild and definitely not helpful on
landing things fast, just in case it may bring some thoughts.

Thanks,