Message ID | 20230501140141.11743-1-avihaih@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | migration: Add precopy initial data capability and VFIO precopy support | expand |
On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote: > Hello everyone, Hi, Avihai, > === Flow of operation === > > To use precopy initial data, the capability must be enabled in the > source. > > As this capability must be supported also in the destination, a > handshake is performed during migration setup. The purpose of the > handshake is to notify the destination that precopy initial data is used > and to check if it's supported. > > The handshake is done in two levels. First, a general handshake is done > with the destination migration code to notify that precopy initial data > is used. Then, for each migration user in the source that supports > precopy initial data, a handshake is done with its counterpart in the > destination: > If both support it, precopy initial data will be used for them. > If source doesn't support it, precopy initial data will not be used for > them. > If source supports it and destination doesn't, migration will be failed. > > Assuming the handshake succeeded, migration starts to send precopy data > and as part of it also the initial precopy data. Initial precopy data is > just like any other precopy data and as such, migration code is not > aware of it. Therefore, it's the responsibility of the migration users > (such as VFIO devices) to notify their counterparts in the destination > that their initial precopy data has been sent (for example, VFIO > migration does it when its initial bytes reach zero). > > In the destination, migration code will query each migration user that > supports precopy initial data and check if its initial data has been > loaded. If initial data has been loaded by all of them, an ACK will be > sent to the source which will now be able to complete migration when > appropriate. I can understand why this is useful, what I'm not 100% sure is whether the complexity is needed. The idea seems to be that src never switchover unless it receives a READY notification from dst. I'm imaging below simplified and more general workflow, not sure whether it could work for you: - Introduce a new cap "switchover-ready", it means whether there'll be a ready event sent from dst -> src for "being ready for switchover" - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and handled on src showing that dest is ready for switchover. It'll be sent only if dest is ready for the switchover - Introduce a field SaveVMHandlers.explicit_switchover_needed. For each special device like vfio that would like to participate in the decision making, device can set its explicit_switchover_needed=1. This field is ignored if the new cap is not set. - Dst qemu: when new cap set, remember how many special devices are there requesting explicit switchover (count of SaveVMHandlers that has the bit set during load setup) as switch_over_pending=N. - Dst qemu: Once a device thinks its fine to switchover (probably in the load_state() callback), it calls migration_notify_switchover_ready(). That decreases switch_over_pending and when it hits zero, one msg MIG_RP_MSG_SWITCHOVER_READY will be sent to src. Only until READY msg received on src could src switchover the precopy to dst. Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1 more msg (dst->src). This is based on the fact that right now we always set caps on both qemus so I suppose it already means either both have or don't have the feature (even if one has, not setting the cap means disabled on both). Would it work for this case and cleaner? Thanks,
On 03/05/2023 1:49, Peter Xu wrote: > External email: Use caution opening links or attachments > > > On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote: >> Hello everyone, > Hi, Avihai, > >> === Flow of operation === >> >> To use precopy initial data, the capability must be enabled in the >> source. >> >> As this capability must be supported also in the destination, a >> handshake is performed during migration setup. The purpose of the >> handshake is to notify the destination that precopy initial data is used >> and to check if it's supported. >> >> The handshake is done in two levels. First, a general handshake is done >> with the destination migration code to notify that precopy initial data >> is used. Then, for each migration user in the source that supports >> precopy initial data, a handshake is done with its counterpart in the >> destination: >> If both support it, precopy initial data will be used for them. >> If source doesn't support it, precopy initial data will not be used for >> them. >> If source supports it and destination doesn't, migration will be failed. >> >> Assuming the handshake succeeded, migration starts to send precopy data >> and as part of it also the initial precopy data. Initial precopy data is >> just like any other precopy data and as such, migration code is not >> aware of it. Therefore, it's the responsibility of the migration users >> (such as VFIO devices) to notify their counterparts in the destination >> that their initial precopy data has been sent (for example, VFIO >> migration does it when its initial bytes reach zero). >> >> In the destination, migration code will query each migration user that >> supports precopy initial data and check if its initial data has been >> loaded. If initial data has been loaded by all of them, an ACK will be >> sent to the source which will now be able to complete migration when >> appropriate. > I can understand why this is useful, what I'm not 100% sure is whether the > complexity is needed. The idea seems to be that src never switchover > unless it receives a READY notification from dst. > > I'm imaging below simplified and more general workflow, not sure whether it > could work for you: > > - Introduce a new cap "switchover-ready", it means whether there'll be a > ready event sent from dst -> src for "being ready for switchover" > > - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and > handled on src showing that dest is ready for switchover. It'll be sent > only if dest is ready for the switchover > > - Introduce a field SaveVMHandlers.explicit_switchover_needed. For each > special device like vfio that would like to participate in the decision > making, device can set its explicit_switchover_needed=1. This field is > ignored if the new cap is not set. > > - Dst qemu: when new cap set, remember how many special devices are there > requesting explicit switchover (count of SaveVMHandlers that has the > bit set during load setup) as switch_over_pending=N. > > - Dst qemu: Once a device thinks its fine to switchover (probably in the > load_state() callback), it calls migration_notify_switchover_ready(). > That decreases switch_over_pending and when it hits zero, one msg > MIG_RP_MSG_SWITCHOVER_READY will be sent to src. > > Only until READY msg received on src could src switchover the precopy to > dst. > > Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1 > more msg (dst->src). > > This is based on the fact that right now we always set caps on both qemus > so I suppose it already means either both have or don't have the feature > (even if one has, not setting the cap means disabled on both). > > Would it work for this case and cleaner? Hi Peter, thanks for the response! Your approach is indeed much simpler, however I have a few concerns regarding compatibility. You are saying that caps are always set both in src and dest. But what happens if we set the cap only on one side? Should we care about these scenarios? For example, if we set the cap only in src, then src will wait indefinitely for dest to notify that switchover is ready. Would you expect migration to fail instead of just keep running indefinitely? In current approach we only need to enable the cap in the source, so such scenario can't happen. Let's look at some other scenario. Src QEMU supports explicit-switchover for device X but *not* for device Y (i.e., src QEMU is some older version of QEMU that supports explicit-switchover for device X but not for Y). Dest QEMU supports explicit-switchover for device X and device Y. The capability is set in both src and dest. In the destination we will have switchover_pending=2 because both X and Y support explicit-switchover. We do migration, but switchover_pending will never reach 0 because only X supports it in the source, so the migration will run indefinitely. The per-device handshake solves this by making device Y not use explicit-switchover in this case. Thanks.
On Wed, May 03, 2023 at 06:22:59PM +0300, Avihai Horon wrote: > > On 03/05/2023 1:49, Peter Xu wrote: > > External email: Use caution opening links or attachments > > > > > > On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote: > > > Hello everyone, > > Hi, Avihai, > > > > > === Flow of operation === > > > > > > To use precopy initial data, the capability must be enabled in the > > > source. > > > > > > As this capability must be supported also in the destination, a > > > handshake is performed during migration setup. The purpose of the > > > handshake is to notify the destination that precopy initial data is used > > > and to check if it's supported. > > > > > > The handshake is done in two levels. First, a general handshake is done > > > with the destination migration code to notify that precopy initial data > > > is used. Then, for each migration user in the source that supports > > > precopy initial data, a handshake is done with its counterpart in the > > > destination: > > > If both support it, precopy initial data will be used for them. > > > If source doesn't support it, precopy initial data will not be used for > > > them. > > > If source supports it and destination doesn't, migration will be failed. > > > > > > Assuming the handshake succeeded, migration starts to send precopy data > > > and as part of it also the initial precopy data. Initial precopy data is > > > just like any other precopy data and as such, migration code is not > > > aware of it. Therefore, it's the responsibility of the migration users > > > (such as VFIO devices) to notify their counterparts in the destination > > > that their initial precopy data has been sent (for example, VFIO > > > migration does it when its initial bytes reach zero). > > > > > > In the destination, migration code will query each migration user that > > > supports precopy initial data and check if its initial data has been > > > loaded. If initial data has been loaded by all of them, an ACK will be > > > sent to the source which will now be able to complete migration when > > > appropriate. > > I can understand why this is useful, what I'm not 100% sure is whether the > > complexity is needed. The idea seems to be that src never switchover > > unless it receives a READY notification from dst. > > > > I'm imaging below simplified and more general workflow, not sure whether it > > could work for you: > > > > - Introduce a new cap "switchover-ready", it means whether there'll be a > > ready event sent from dst -> src for "being ready for switchover" > > > > - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and > > handled on src showing that dest is ready for switchover. It'll be sent > > only if dest is ready for the switchover > > > > - Introduce a field SaveVMHandlers.explicit_switchover_needed. For each > > special device like vfio that would like to participate in the decision > > making, device can set its explicit_switchover_needed=1. This field is > > ignored if the new cap is not set. > > > > - Dst qemu: when new cap set, remember how many special devices are there > > requesting explicit switchover (count of SaveVMHandlers that has the > > bit set during load setup) as switch_over_pending=N. > > > > - Dst qemu: Once a device thinks its fine to switchover (probably in the > > load_state() callback), it calls migration_notify_switchover_ready(). > > That decreases switch_over_pending and when it hits zero, one msg > > MIG_RP_MSG_SWITCHOVER_READY will be sent to src. > > > > Only until READY msg received on src could src switchover the precopy to > > dst. > > > > Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1 > > more msg (dst->src). > > > > This is based on the fact that right now we always set caps on both qemus > > so I suppose it already means either both have or don't have the feature > > (even if one has, not setting the cap means disabled on both). > > > > Would it work for this case and cleaner? > > Hi Peter, thanks for the response! > Your approach is indeed much simpler, however I have a few concerns > regarding compatibility. > > You are saying that caps are always set both in src and dest. > But what happens if we set the cap only on one side? > Should we care about these scenarios? I think it's not needed for now, but I am aware that this is a problem. It's just that it is a more generic problem to me rather than very special in the current feature being proposed. At least there're a few times Daniel showed concern on keeping this way and hoped we can have a better handshake in general with migration framework. I'd be perfectly fine if you want to approach this with a handshake methodology, but I hope if so we should provide a more generic handshake. So potentially that can make this new feature rely on the handshake work, and slower to get into shape. Your call on how to address this, at least fine by me either way. In my imagination a generic handshake should happen at the very start of migration and negociate feature bits between src/dst qemu, so they can reach a consensus on what to do next. > For example, if we set the cap only in src, then src will wait indefinitely > for dest to notify that switchover is ready. > Would you expect migration to fail instead of just keep running > indefinitely? > In current approach we only need to enable the cap in the source, so such > scenario can't happen. > > Let's look at some other scenario. > Src QEMU supports explicit-switchover for device X but *not* for device Y > (i.e., src QEMU is some older version of QEMU that supports > explicit-switchover for device X but not for Y). > Dest QEMU supports explicit-switchover for device X and device Y. > The capability is set in both src and dest. > In the destination we will have switchover_pending=2 because both X and Y > support explicit-switchover. > We do migration, but switchover_pending will never reach 0 because only X > supports it in the source, so the migration will run indefinitely. > The per-device handshake solves this by making device Y not use > explicit-switchover in this case. Hmm, right. When I was replying obviously I thought that decision can be made sololy by the dest qemu, then I assumed it's fine. Because IIUC in that case how many devices that supports switchover_pending on src qemu doesn't really matter but only dest. But I re-read the last patch and I do see that there's a new bit that will change the device protocol of migration: if (migration->initial_data_active && !migration->precopy_init_size && !migration->initial_data_sent) { qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT); migration->initial_data_sent = true; } else { qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); } With this, I think what you said makes sense because then the src qemu matters on deciding whether to send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT, it also needs to make sure dst qemu will recognize it. Do you think this new VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is a must to have? Can this decision be made on dest qemu only? To ask in another way, I saw that precopy_init_size is the fundation to decide whether to send this flag. Then it's a matter of whether dest qemu is also aware of precopy_init_size, then it can already tell when it's ready to handle the switchover. Thanks,
On 03/05/2023 18:49, Peter Xu wrote: > External email: Use caution opening links or attachments > > > On Wed, May 03, 2023 at 06:22:59PM +0300, Avihai Horon wrote: >> On 03/05/2023 1:49, Peter Xu wrote: >>> External email: Use caution opening links or attachments >>> >>> >>> On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote: >>>> Hello everyone, >>> Hi, Avihai, >>> >>>> === Flow of operation === >>>> >>>> To use precopy initial data, the capability must be enabled in the >>>> source. >>>> >>>> As this capability must be supported also in the destination, a >>>> handshake is performed during migration setup. The purpose of the >>>> handshake is to notify the destination that precopy initial data is used >>>> and to check if it's supported. >>>> >>>> The handshake is done in two levels. First, a general handshake is done >>>> with the destination migration code to notify that precopy initial data >>>> is used. Then, for each migration user in the source that supports >>>> precopy initial data, a handshake is done with its counterpart in the >>>> destination: >>>> If both support it, precopy initial data will be used for them. >>>> If source doesn't support it, precopy initial data will not be used for >>>> them. >>>> If source supports it and destination doesn't, migration will be failed. >>>> >>>> Assuming the handshake succeeded, migration starts to send precopy data >>>> and as part of it also the initial precopy data. Initial precopy data is >>>> just like any other precopy data and as such, migration code is not >>>> aware of it. Therefore, it's the responsibility of the migration users >>>> (such as VFIO devices) to notify their counterparts in the destination >>>> that their initial precopy data has been sent (for example, VFIO >>>> migration does it when its initial bytes reach zero). >>>> >>>> In the destination, migration code will query each migration user that >>>> supports precopy initial data and check if its initial data has been >>>> loaded. If initial data has been loaded by all of them, an ACK will be >>>> sent to the source which will now be able to complete migration when >>>> appropriate. >>> I can understand why this is useful, what I'm not 100% sure is whether the >>> complexity is needed. The idea seems to be that src never switchover >>> unless it receives a READY notification from dst. >>> >>> I'm imaging below simplified and more general workflow, not sure whether it >>> could work for you: >>> >>> - Introduce a new cap "switchover-ready", it means whether there'll be a >>> ready event sent from dst -> src for "being ready for switchover" >>> >>> - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and >>> handled on src showing that dest is ready for switchover. It'll be sent >>> only if dest is ready for the switchover >>> >>> - Introduce a field SaveVMHandlers.explicit_switchover_needed. For each >>> special device like vfio that would like to participate in the decision >>> making, device can set its explicit_switchover_needed=1. This field is >>> ignored if the new cap is not set. >>> >>> - Dst qemu: when new cap set, remember how many special devices are there >>> requesting explicit switchover (count of SaveVMHandlers that has the >>> bit set during load setup) as switch_over_pending=N. >>> >>> - Dst qemu: Once a device thinks its fine to switchover (probably in the >>> load_state() callback), it calls migration_notify_switchover_ready(). >>> That decreases switch_over_pending and when it hits zero, one msg >>> MIG_RP_MSG_SWITCHOVER_READY will be sent to src. >>> >>> Only until READY msg received on src could src switchover the precopy to >>> dst. >>> >>> Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1 >>> more msg (dst->src). >>> >>> This is based on the fact that right now we always set caps on both qemus >>> so I suppose it already means either both have or don't have the feature >>> (even if one has, not setting the cap means disabled on both). >>> >>> Would it work for this case and cleaner? >> Hi Peter, thanks for the response! >> Your approach is indeed much simpler, however I have a few concerns >> regarding compatibility. >> >> You are saying that caps are always set both in src and dest. >> But what happens if we set the cap only on one side? >> Should we care about these scenarios? > I think it's not needed for now, but I am aware that this is a problem. > It's just that it is a more generic problem to me rather than very special > in the current feature being proposed. At least there're a few times > Daniel showed concern on keeping this way and hoped we can have a better > handshake in general with migration framework. > > I'd be perfectly fine if you want to approach this with a handshake > methodology, but I hope if so we should provide a more generic handshake. > So potentially that can make this new feature rely on the handshake work, > and slower to get into shape. Your call on how to address this, at least > fine by me either way. I'd really like this feature to get in, and I'm afraid making it dependent on first implementing a general migration handshake may take a long time, like you said. What about keeping current approach but changing it such that the capability will have to be set in both src and dest, to make it similar to other capability usages? I.e., we will remove the "general" handshake: /* Enable precopy initial data generally in the migration */ memset(&buf, 0, sizeof(buf)); buf.general_enable = 1; qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf), (uint8_t *)&buf); but keep the per-device handshake, which is not a handshake for migration capabilities, but a part of the protocol when the capability is set, like in multifd, postcopy, etc. This way we can advance with this feature while making the general migration handshake an independent effort. Will that work for you? BTW, with your suggestion to add a notification mechanism to notify when initial data is loaded in dest, I think we can drop these two SaveVMHandlers handlers: /* * Checks if precopy initial data is active. If it's inactive, * initial_data_loaded check is skipped. */ bool (*is_initial_data_active)(void *opaque); /* Checks if precopy initial data has been loaded in dest */ bool (*initial_data_loaded)(void *opaque); > In my imagination a generic handshake should happen at the very start of > migration and negociate feature bits between src/dst qemu, so they can > reach a consensus on what to do next. > >> For example, if we set the cap only in src, then src will wait indefinitely >> for dest to notify that switchover is ready. >> Would you expect migration to fail instead of just keep running >> indefinitely? >> In current approach we only need to enable the cap in the source, so such >> scenario can't happen. >> >> Let's look at some other scenario. >> Src QEMU supports explicit-switchover for device X but *not* for device Y >> (i.e., src QEMU is some older version of QEMU that supports >> explicit-switchover for device X but not for Y). >> Dest QEMU supports explicit-switchover for device X and device Y. >> The capability is set in both src and dest. >> In the destination we will have switchover_pending=2 because both X and Y >> support explicit-switchover. >> We do migration, but switchover_pending will never reach 0 because only X >> supports it in the source, so the migration will run indefinitely. >> The per-device handshake solves this by making device Y not use >> explicit-switchover in this case. > Hmm, right. When I was replying obviously I thought that decision can be > made sololy by the dest qemu, then I assumed it's fine. Because IIUC in > that case how many devices that supports switchover_pending on src qemu > doesn't really matter but only dest. > > But I re-read the last patch and I do see that there's a new bit that will > change the device protocol of migration: > > if (migration->initial_data_active && !migration->precopy_init_size && > !migration->initial_data_sent) { > qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT); > migration->initial_data_sent = true; > } else { > qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); > } > > With this, I think what you said makes sense because then the src qemu > matters on deciding whether to send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT, it > also needs to make sure dst qemu will recognize it. > > Do you think this new VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is a must to have? > Can this decision be made on dest qemu only? > > To ask in another way, I saw that precopy_init_size is the fundation to > decide whether to send this flag. Then it's a matter of whether dest qemu > is also aware of precopy_init_size, then it can already tell when it's > ready to handle the switchover. The destination is not aware of precopy_init_size, only the source knows it. So the source must send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to notify dest that the initial data was sent. Thanks.
On Thu, May 04, 2023 at 01:18:04PM +0300, Avihai Horon wrote: > > On 03/05/2023 18:49, Peter Xu wrote: > > External email: Use caution opening links or attachments > > > > > > On Wed, May 03, 2023 at 06:22:59PM +0300, Avihai Horon wrote: > > > On 03/05/2023 1:49, Peter Xu wrote: > > > > External email: Use caution opening links or attachments > > > > > > > > > > > > On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote: > > > > > Hello everyone, > > > > Hi, Avihai, > > > > > > > > > === Flow of operation === > > > > > > > > > > To use precopy initial data, the capability must be enabled in the > > > > > source. > > > > > > > > > > As this capability must be supported also in the destination, a > > > > > handshake is performed during migration setup. The purpose of the > > > > > handshake is to notify the destination that precopy initial data is used > > > > > and to check if it's supported. > > > > > > > > > > The handshake is done in two levels. First, a general handshake is done > > > > > with the destination migration code to notify that precopy initial data > > > > > is used. Then, for each migration user in the source that supports > > > > > precopy initial data, a handshake is done with its counterpart in the > > > > > destination: > > > > > If both support it, precopy initial data will be used for them. > > > > > If source doesn't support it, precopy initial data will not be used for > > > > > them. > > > > > If source supports it and destination doesn't, migration will be failed. > > > > > > > > > > Assuming the handshake succeeded, migration starts to send precopy data > > > > > and as part of it also the initial precopy data. Initial precopy data is > > > > > just like any other precopy data and as such, migration code is not > > > > > aware of it. Therefore, it's the responsibility of the migration users > > > > > (such as VFIO devices) to notify their counterparts in the destination > > > > > that their initial precopy data has been sent (for example, VFIO > > > > > migration does it when its initial bytes reach zero). > > > > > > > > > > In the destination, migration code will query each migration user that > > > > > supports precopy initial data and check if its initial data has been > > > > > loaded. If initial data has been loaded by all of them, an ACK will be > > > > > sent to the source which will now be able to complete migration when > > > > > appropriate. > > > > I can understand why this is useful, what I'm not 100% sure is whether the > > > > complexity is needed. The idea seems to be that src never switchover > > > > unless it receives a READY notification from dst. > > > > > > > > I'm imaging below simplified and more general workflow, not sure whether it > > > > could work for you: > > > > > > > > - Introduce a new cap "switchover-ready", it means whether there'll be a > > > > ready event sent from dst -> src for "being ready for switchover" > > > > > > > > - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and > > > > handled on src showing that dest is ready for switchover. It'll be sent > > > > only if dest is ready for the switchover > > > > > > > > - Introduce a field SaveVMHandlers.explicit_switchover_needed. For each > > > > special device like vfio that would like to participate in the decision > > > > making, device can set its explicit_switchover_needed=1. This field is > > > > ignored if the new cap is not set. > > > > > > > > - Dst qemu: when new cap set, remember how many special devices are there > > > > requesting explicit switchover (count of SaveVMHandlers that has the > > > > bit set during load setup) as switch_over_pending=N. > > > > > > > > - Dst qemu: Once a device thinks its fine to switchover (probably in the > > > > load_state() callback), it calls migration_notify_switchover_ready(). > > > > That decreases switch_over_pending and when it hits zero, one msg > > > > MIG_RP_MSG_SWITCHOVER_READY will be sent to src. > > > > > > > > Only until READY msg received on src could src switchover the precopy to > > > > dst. > > > > > > > > Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1 > > > > more msg (dst->src). > > > > > > > > This is based on the fact that right now we always set caps on both qemus > > > > so I suppose it already means either both have or don't have the feature > > > > (even if one has, not setting the cap means disabled on both). > > > > > > > > Would it work for this case and cleaner? > > > Hi Peter, thanks for the response! > > > Your approach is indeed much simpler, however I have a few concerns > > > regarding compatibility. > > > > > > You are saying that caps are always set both in src and dest. > > > But what happens if we set the cap only on one side? > > > Should we care about these scenarios? > > I think it's not needed for now, but I am aware that this is a problem. > > It's just that it is a more generic problem to me rather than very special > > in the current feature being proposed. At least there're a few times > > Daniel showed concern on keeping this way and hoped we can have a better > > handshake in general with migration framework. > > > > I'd be perfectly fine if you want to approach this with a handshake > > methodology, but I hope if so we should provide a more generic handshake. > > So potentially that can make this new feature rely on the handshake work, > > and slower to get into shape. Your call on how to address this, at least > > fine by me either way. > > I'd really like this feature to get in, and I'm afraid making it dependent > on first implementing a general migration handshake may take a long time, > like you said. > What about keeping current approach but changing it such that the capability > will have to be set in both src and dest, to make it similar to other > capability usages? > I.e., we will remove the "general" handshake: > > /* Enable precopy initial data generally in the migration */ > memset(&buf, 0, sizeof(buf)); > buf.general_enable = 1; > qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf), > (uint8_t *)&buf); > > but keep the per-device handshake, which is not a handshake for migration > capabilities, but a part of the protocol when the capability is set, like in > multifd, postcopy, etc. > This way we can advance with this feature while making the general migration > handshake an independent effort. > Will that work for you? Yes it's fine by me. > > BTW, with your suggestion to add a notification mechanism to notify when > initial data is loaded in dest, I think we can drop these two SaveVMHandlers > handlers: > /* > * Checks if precopy initial data is active. If it's inactive, > * initial_data_loaded check is skipped. > */ > bool (*is_initial_data_active)(void *opaque); > /* Checks if precopy initial data has been loaded in dest */ > bool (*initial_data_loaded)(void *opaque); > > > In my imagination a generic handshake should happen at the very start of > > migration and negociate feature bits between src/dst qemu, so they can > > reach a consensus on what to do next. > > > > > For example, if we set the cap only in src, then src will wait indefinitely > > > for dest to notify that switchover is ready. > > > Would you expect migration to fail instead of just keep running > > > indefinitely? > > > In current approach we only need to enable the cap in the source, so such > > > scenario can't happen. > > > > > > Let's look at some other scenario. > > > Src QEMU supports explicit-switchover for device X but *not* for device Y > > > (i.e., src QEMU is some older version of QEMU that supports > > > explicit-switchover for device X but not for Y). > > > Dest QEMU supports explicit-switchover for device X and device Y. > > > The capability is set in both src and dest. > > > In the destination we will have switchover_pending=2 because both X and Y > > > support explicit-switchover. > > > We do migration, but switchover_pending will never reach 0 because only X > > > supports it in the source, so the migration will run indefinitely. > > > The per-device handshake solves this by making device Y not use > > > explicit-switchover in this case. > > Hmm, right. When I was replying obviously I thought that decision can be > > made sololy by the dest qemu, then I assumed it's fine. Because IIUC in > > that case how many devices that supports switchover_pending on src qemu > > doesn't really matter but only dest. > > > > But I re-read the last patch and I do see that there's a new bit that will > > change the device protocol of migration: > > > > if (migration->initial_data_active && !migration->precopy_init_size && > > !migration->initial_data_sent) { > > qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT); > > migration->initial_data_sent = true; > > } else { > > qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); > > } > > > > With this, I think what you said makes sense because then the src qemu > > matters on deciding whether to send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT, it > > also needs to make sure dst qemu will recognize it. > > > > Do you think this new VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is a must to have? > > Can this decision be made on dest qemu only? > > > > To ask in another way, I saw that precopy_init_size is the fundation to > > decide whether to send this flag. Then it's a matter of whether dest qemu > > is also aware of precopy_init_size, then it can already tell when it's > > ready to handle the switchover. > > The destination is not aware of precopy_init_size, only the source knows it. > So the source must send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to notify dest that > the initial data was sent. Then, can the src qemu notify instead? We can have similar notification mechanism on src qemu and if that can work we can further same the other MIG_RP_MSG. The counter counts how many special src devices are there and we don't switchover only if all agree. I know that even if !precopy_init_size on src, it doesn't mean that dest has already digested all the data in the send buffer. However since we'll anyway make sure queued data landed before switch over happens (e.g., when we only have 1 migration channel data are sent in sequential manner), it means when switchover the dst qemu should have these loaded? Thanks,
On 04/05/2023 18:50, Peter Xu wrote: > External email: Use caution opening links or attachments > > > On Thu, May 04, 2023 at 01:18:04PM +0300, Avihai Horon wrote: >> On 03/05/2023 18:49, Peter Xu wrote: >>> External email: Use caution opening links or attachments >>> >>> >>> On Wed, May 03, 2023 at 06:22:59PM +0300, Avihai Horon wrote: >>>> On 03/05/2023 1:49, Peter Xu wrote: >>>>> External email: Use caution opening links or attachments >>>>> >>>>> >>>>> On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote: >>>>>> Hello everyone, >>>>> Hi, Avihai, >>>>> >>>>>> === Flow of operation === >>>>>> >>>>>> To use precopy initial data, the capability must be enabled in the >>>>>> source. >>>>>> >>>>>> As this capability must be supported also in the destination, a >>>>>> handshake is performed during migration setup. The purpose of the >>>>>> handshake is to notify the destination that precopy initial data is used >>>>>> and to check if it's supported. >>>>>> >>>>>> The handshake is done in two levels. First, a general handshake is done >>>>>> with the destination migration code to notify that precopy initial data >>>>>> is used. Then, for each migration user in the source that supports >>>>>> precopy initial data, a handshake is done with its counterpart in the >>>>>> destination: >>>>>> If both support it, precopy initial data will be used for them. >>>>>> If source doesn't support it, precopy initial data will not be used for >>>>>> them. >>>>>> If source supports it and destination doesn't, migration will be failed. >>>>>> >>>>>> Assuming the handshake succeeded, migration starts to send precopy data >>>>>> and as part of it also the initial precopy data. Initial precopy data is >>>>>> just like any other precopy data and as such, migration code is not >>>>>> aware of it. Therefore, it's the responsibility of the migration users >>>>>> (such as VFIO devices) to notify their counterparts in the destination >>>>>> that their initial precopy data has been sent (for example, VFIO >>>>>> migration does it when its initial bytes reach zero). >>>>>> >>>>>> In the destination, migration code will query each migration user that >>>>>> supports precopy initial data and check if its initial data has been >>>>>> loaded. If initial data has been loaded by all of them, an ACK will be >>>>>> sent to the source which will now be able to complete migration when >>>>>> appropriate. >>>>> I can understand why this is useful, what I'm not 100% sure is whether the >>>>> complexity is needed. The idea seems to be that src never switchover >>>>> unless it receives a READY notification from dst. >>>>> >>>>> I'm imaging below simplified and more general workflow, not sure whether it >>>>> could work for you: >>>>> >>>>> - Introduce a new cap "switchover-ready", it means whether there'll be a >>>>> ready event sent from dst -> src for "being ready for switchover" >>>>> >>>>> - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and >>>>> handled on src showing that dest is ready for switchover. It'll be sent >>>>> only if dest is ready for the switchover >>>>> >>>>> - Introduce a field SaveVMHandlers.explicit_switchover_needed. For each >>>>> special device like vfio that would like to participate in the decision >>>>> making, device can set its explicit_switchover_needed=1. This field is >>>>> ignored if the new cap is not set. >>>>> >>>>> - Dst qemu: when new cap set, remember how many special devices are there >>>>> requesting explicit switchover (count of SaveVMHandlers that has the >>>>> bit set during load setup) as switch_over_pending=N. >>>>> >>>>> - Dst qemu: Once a device thinks its fine to switchover (probably in the >>>>> load_state() callback), it calls migration_notify_switchover_ready(). >>>>> That decreases switch_over_pending and when it hits zero, one msg >>>>> MIG_RP_MSG_SWITCHOVER_READY will be sent to src. >>>>> >>>>> Only until READY msg received on src could src switchover the precopy to >>>>> dst. >>>>> >>>>> Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1 >>>>> more msg (dst->src). >>>>> >>>>> This is based on the fact that right now we always set caps on both qemus >>>>> so I suppose it already means either both have or don't have the feature >>>>> (even if one has, not setting the cap means disabled on both). >>>>> >>>>> Would it work for this case and cleaner? >>>> Hi Peter, thanks for the response! >>>> Your approach is indeed much simpler, however I have a few concerns >>>> regarding compatibility. >>>> >>>> You are saying that caps are always set both in src and dest. >>>> But what happens if we set the cap only on one side? >>>> Should we care about these scenarios? >>> I think it's not needed for now, but I am aware that this is a problem. >>> It's just that it is a more generic problem to me rather than very special >>> in the current feature being proposed. At least there're a few times >>> Daniel showed concern on keeping this way and hoped we can have a better >>> handshake in general with migration framework. >>> >>> I'd be perfectly fine if you want to approach this with a handshake >>> methodology, but I hope if so we should provide a more generic handshake. >>> So potentially that can make this new feature rely on the handshake work, >>> and slower to get into shape. Your call on how to address this, at least >>> fine by me either way. >> I'd really like this feature to get in, and I'm afraid making it dependent >> on first implementing a general migration handshake may take a long time, >> like you said. >> What about keeping current approach but changing it such that the capability >> will have to be set in both src and dest, to make it similar to other >> capability usages? >> I.e., we will remove the "general" handshake: >> >> /* Enable precopy initial data generally in the migration */ >> memset(&buf, 0, sizeof(buf)); >> buf.general_enable = 1; >> qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf), >> (uint8_t *)&buf); >> >> but keep the per-device handshake, which is not a handshake for migration >> capabilities, but a part of the protocol when the capability is set, like in >> multifd, postcopy, etc. >> This way we can advance with this feature while making the general migration >> handshake an independent effort. >> Will that work for you? > Yes it's fine by me. > >> BTW, with your suggestion to add a notification mechanism to notify when >> initial data is loaded in dest, I think we can drop these two SaveVMHandlers >> handlers: >> /* >> * Checks if precopy initial data is active. If it's inactive, >> * initial_data_loaded check is skipped. >> */ >> bool (*is_initial_data_active)(void *opaque); >> /* Checks if precopy initial data has been loaded in dest */ >> bool (*initial_data_loaded)(void *opaque); >> >>> In my imagination a generic handshake should happen at the very start of >>> migration and negociate feature bits between src/dst qemu, so they can >>> reach a consensus on what to do next. >>> >>>> For example, if we set the cap only in src, then src will wait indefinitely >>>> for dest to notify that switchover is ready. >>>> Would you expect migration to fail instead of just keep running >>>> indefinitely? >>>> In current approach we only need to enable the cap in the source, so such >>>> scenario can't happen. >>>> >>>> Let's look at some other scenario. >>>> Src QEMU supports explicit-switchover for device X but *not* for device Y >>>> (i.e., src QEMU is some older version of QEMU that supports >>>> explicit-switchover for device X but not for Y). >>>> Dest QEMU supports explicit-switchover for device X and device Y. >>>> The capability is set in both src and dest. >>>> In the destination we will have switchover_pending=2 because both X and Y >>>> support explicit-switchover. >>>> We do migration, but switchover_pending will never reach 0 because only X >>>> supports it in the source, so the migration will run indefinitely. >>>> The per-device handshake solves this by making device Y not use >>>> explicit-switchover in this case. >>> Hmm, right. When I was replying obviously I thought that decision can be >>> made sololy by the dest qemu, then I assumed it's fine. Because IIUC in >>> that case how many devices that supports switchover_pending on src qemu >>> doesn't really matter but only dest. >>> >>> But I re-read the last patch and I do see that there's a new bit that will >>> change the device protocol of migration: >>> >>> if (migration->initial_data_active && !migration->precopy_init_size && >>> !migration->initial_data_sent) { >>> qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT); >>> migration->initial_data_sent = true; >>> } else { >>> qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); >>> } >>> >>> With this, I think what you said makes sense because then the src qemu >>> matters on deciding whether to send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT, it >>> also needs to make sure dst qemu will recognize it. >>> >>> Do you think this new VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is a must to have? >>> Can this decision be made on dest qemu only? >>> >>> To ask in another way, I saw that precopy_init_size is the fundation to >>> decide whether to send this flag. Then it's a matter of whether dest qemu >>> is also aware of precopy_init_size, then it can already tell when it's >>> ready to handle the switchover. >> The destination is not aware of precopy_init_size, only the source knows it. >> So the source must send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to notify dest that >> the initial data was sent. > Then, can the src qemu notify instead? > > We can have similar notification mechanism on src qemu and if that can work > we can further same the other MIG_RP_MSG. The counter counts how many > special src devices are there and we don't switchover only if all agree. Do you mean the following: We will have the pending counter and notification mechanism in source instead of destination. The MIG_RP_MSG_INITIAL_DATA_LOADED_ACK message will be sent by each device in the destination that loaded its initial data. Each such RP_MSG will decrease the pending counter in source. When the pending counter in source reaches zero, source can complete migration. Or did I misunderstand you? > I know that even if !precopy_init_size on src, it doesn't mean that dest > has already digested all the data in the send buffer. However since we'll > anyway make sure queued data landed before switch over happens (e.g., when > we only have 1 migration channel data are sent in sequential manner), it > means when switchover the dst qemu should have these loaded? > > Thanks, > > -- > Peter Xu >
On Sun, May 07, 2023 at 03:54:00PM +0300, Avihai Horon wrote: > > On 04/05/2023 18:50, Peter Xu wrote: > > External email: Use caution opening links or attachments > > > > > > On Thu, May 04, 2023 at 01:18:04PM +0300, Avihai Horon wrote: > > > On 03/05/2023 18:49, Peter Xu wrote: > > > > External email: Use caution opening links or attachments > > > > > > > > > > > > On Wed, May 03, 2023 at 06:22:59PM +0300, Avihai Horon wrote: > > > > > On 03/05/2023 1:49, Peter Xu wrote: > > > > > > External email: Use caution opening links or attachments > > > > > > > > > > > > > > > > > > On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote: > > > > > > > Hello everyone, > > > > > > Hi, Avihai, > > > > > > > > > > > > > === Flow of operation === > > > > > > > > > > > > > > To use precopy initial data, the capability must be enabled in the > > > > > > > source. > > > > > > > > > > > > > > As this capability must be supported also in the destination, a > > > > > > > handshake is performed during migration setup. The purpose of the > > > > > > > handshake is to notify the destination that precopy initial data is used > > > > > > > and to check if it's supported. > > > > > > > > > > > > > > The handshake is done in two levels. First, a general handshake is done > > > > > > > with the destination migration code to notify that precopy initial data > > > > > > > is used. Then, for each migration user in the source that supports > > > > > > > precopy initial data, a handshake is done with its counterpart in the > > > > > > > destination: > > > > > > > If both support it, precopy initial data will be used for them. > > > > > > > If source doesn't support it, precopy initial data will not be used for > > > > > > > them. > > > > > > > If source supports it and destination doesn't, migration will be failed. > > > > > > > > > > > > > > Assuming the handshake succeeded, migration starts to send precopy data > > > > > > > and as part of it also the initial precopy data. Initial precopy data is > > > > > > > just like any other precopy data and as such, migration code is not > > > > > > > aware of it. Therefore, it's the responsibility of the migration users > > > > > > > (such as VFIO devices) to notify their counterparts in the destination > > > > > > > that their initial precopy data has been sent (for example, VFIO > > > > > > > migration does it when its initial bytes reach zero). > > > > > > > > > > > > > > In the destination, migration code will query each migration user that > > > > > > > supports precopy initial data and check if its initial data has been > > > > > > > loaded. If initial data has been loaded by all of them, an ACK will be > > > > > > > sent to the source which will now be able to complete migration when > > > > > > > appropriate. > > > > > > I can understand why this is useful, what I'm not 100% sure is whether the > > > > > > complexity is needed. The idea seems to be that src never switchover > > > > > > unless it receives a READY notification from dst. > > > > > > > > > > > > I'm imaging below simplified and more general workflow, not sure whether it > > > > > > could work for you: > > > > > > > > > > > > - Introduce a new cap "switchover-ready", it means whether there'll be a > > > > > > ready event sent from dst -> src for "being ready for switchover" > > > > > > > > > > > > - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and > > > > > > handled on src showing that dest is ready for switchover. It'll be sent > > > > > > only if dest is ready for the switchover > > > > > > > > > > > > - Introduce a field SaveVMHandlers.explicit_switchover_needed. For each > > > > > > special device like vfio that would like to participate in the decision > > > > > > making, device can set its explicit_switchover_needed=1. This field is > > > > > > ignored if the new cap is not set. > > > > > > > > > > > > - Dst qemu: when new cap set, remember how many special devices are there > > > > > > requesting explicit switchover (count of SaveVMHandlers that has the > > > > > > bit set during load setup) as switch_over_pending=N. > > > > > > > > > > > > - Dst qemu: Once a device thinks its fine to switchover (probably in the > > > > > > load_state() callback), it calls migration_notify_switchover_ready(). > > > > > > That decreases switch_over_pending and when it hits zero, one msg > > > > > > MIG_RP_MSG_SWITCHOVER_READY will be sent to src. > > > > > > > > > > > > Only until READY msg received on src could src switchover the precopy to > > > > > > dst. > > > > > > > > > > > > Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1 > > > > > > more msg (dst->src). > > > > > > > > > > > > This is based on the fact that right now we always set caps on both qemus > > > > > > so I suppose it already means either both have or don't have the feature > > > > > > (even if one has, not setting the cap means disabled on both). > > > > > > > > > > > > Would it work for this case and cleaner? > > > > > Hi Peter, thanks for the response! > > > > > Your approach is indeed much simpler, however I have a few concerns > > > > > regarding compatibility. > > > > > > > > > > You are saying that caps are always set both in src and dest. > > > > > But what happens if we set the cap only on one side? > > > > > Should we care about these scenarios? > > > > I think it's not needed for now, but I am aware that this is a problem. > > > > It's just that it is a more generic problem to me rather than very special > > > > in the current feature being proposed. At least there're a few times > > > > Daniel showed concern on keeping this way and hoped we can have a better > > > > handshake in general with migration framework. > > > > > > > > I'd be perfectly fine if you want to approach this with a handshake > > > > methodology, but I hope if so we should provide a more generic handshake. > > > > So potentially that can make this new feature rely on the handshake work, > > > > and slower to get into shape. Your call on how to address this, at least > > > > fine by me either way. > > > I'd really like this feature to get in, and I'm afraid making it dependent > > > on first implementing a general migration handshake may take a long time, > > > like you said. > > > What about keeping current approach but changing it such that the capability > > > will have to be set in both src and dest, to make it similar to other > > > capability usages? > > > I.e., we will remove the "general" handshake: > > > > > > /* Enable precopy initial data generally in the migration */ > > > memset(&buf, 0, sizeof(buf)); > > > buf.general_enable = 1; > > > qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf), > > > (uint8_t *)&buf); > > > > > > but keep the per-device handshake, which is not a handshake for migration > > > capabilities, but a part of the protocol when the capability is set, like in > > > multifd, postcopy, etc. > > > This way we can advance with this feature while making the general migration > > > handshake an independent effort. > > > Will that work for you? > > Yes it's fine by me. > > > > > BTW, with your suggestion to add a notification mechanism to notify when > > > initial data is loaded in dest, I think we can drop these two SaveVMHandlers > > > handlers: > > > /* > > > * Checks if precopy initial data is active. If it's inactive, > > > * initial_data_loaded check is skipped. > > > */ > > > bool (*is_initial_data_active)(void *opaque); > > > /* Checks if precopy initial data has been loaded in dest */ > > > bool (*initial_data_loaded)(void *opaque); > > > > > > > In my imagination a generic handshake should happen at the very start of > > > > migration and negociate feature bits between src/dst qemu, so they can > > > > reach a consensus on what to do next. > > > > > > > > > For example, if we set the cap only in src, then src will wait indefinitely > > > > > for dest to notify that switchover is ready. > > > > > Would you expect migration to fail instead of just keep running > > > > > indefinitely? > > > > > In current approach we only need to enable the cap in the source, so such > > > > > scenario can't happen. > > > > > > > > > > Let's look at some other scenario. > > > > > Src QEMU supports explicit-switchover for device X but *not* for device Y > > > > > (i.e., src QEMU is some older version of QEMU that supports > > > > > explicit-switchover for device X but not for Y). > > > > > Dest QEMU supports explicit-switchover for device X and device Y. > > > > > The capability is set in both src and dest. > > > > > In the destination we will have switchover_pending=2 because both X and Y > > > > > support explicit-switchover. > > > > > We do migration, but switchover_pending will never reach 0 because only X > > > > > supports it in the source, so the migration will run indefinitely. > > > > > The per-device handshake solves this by making device Y not use > > > > > explicit-switchover in this case. > > > > Hmm, right. When I was replying obviously I thought that decision can be > > > > made sololy by the dest qemu, then I assumed it's fine. Because IIUC in > > > > that case how many devices that supports switchover_pending on src qemu > > > > doesn't really matter but only dest. > > > > > > > > But I re-read the last patch and I do see that there's a new bit that will > > > > change the device protocol of migration: > > > > > > > > if (migration->initial_data_active && !migration->precopy_init_size && > > > > !migration->initial_data_sent) { > > > > qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT); > > > > migration->initial_data_sent = true; > > > > } else { > > > > qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); > > > > } > > > > > > > > With this, I think what you said makes sense because then the src qemu > > > > matters on deciding whether to send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT, it > > > > also needs to make sure dst qemu will recognize it. > > > > > > > > Do you think this new VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is a must to have? > > > > Can this decision be made on dest qemu only? > > > > > > > > To ask in another way, I saw that precopy_init_size is the fundation to > > > > decide whether to send this flag. Then it's a matter of whether dest qemu > > > > is also aware of precopy_init_size, then it can already tell when it's > > > > ready to handle the switchover. > > > The destination is not aware of precopy_init_size, only the source knows it. > > > So the source must send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to notify dest that > > > the initial data was sent. > > Then, can the src qemu notify instead? > > > > We can have similar notification mechanism on src qemu and if that can work > > we can further same the other MIG_RP_MSG. The counter counts how many I meant s/same/save/ here.. > > special src devices are there and we don't switchover only if all agree. > > Do you mean the following: > We will have the pending counter and notification mechanism in source > instead of destination. > The MIG_RP_MSG_INITIAL_DATA_LOADED_ACK message will be sent by each device > in the destination that loaded its initial data. > Each such RP_MSG will decrease the pending counter in source. > When the pending counter in source reaches zero, source can complete > migration. > > Or did I misunderstand you? I meant the notification can come sololy from the src qemu, so src qemu can skip the switchover if any of the src relevant device hasn't yet acknowledged on the switchover. Then I think we can avoid introducing a MIG_RP msg? I'm not sure whether I missed something, though. I stated my understanding on the ordering below. > > > I know that even if !precopy_init_size on src, it doesn't mean that dest > > has already digested all the data in the send buffer. However since we'll > > anyway make sure queued data landed before switch over happens (e.g., when > > we only have 1 migration channel data are sent in sequential manner), it > > means when switchover the dst qemu should have these loaded? > > > > Thanks, > > > > -- > > Peter Xu > > >
On 08/05/2023 3:49, Peter Xu wrote: > External email: Use caution opening links or attachments > > > On Sun, May 07, 2023 at 03:54:00PM +0300, Avihai Horon wrote: >> On 04/05/2023 18:50, Peter Xu wrote: >>> External email: Use caution opening links or attachments >>> >>> >>> On Thu, May 04, 2023 at 01:18:04PM +0300, Avihai Horon wrote: >>>> On 03/05/2023 18:49, Peter Xu wrote: >>>>> External email: Use caution opening links or attachments >>>>> >>>>> >>>>> On Wed, May 03, 2023 at 06:22:59PM +0300, Avihai Horon wrote: >>>>>> On 03/05/2023 1:49, Peter Xu wrote: >>>>>>> External email: Use caution opening links or attachments >>>>>>> >>>>>>> >>>>>>> On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote: >>>>>>>> Hello everyone, >>>>>>> Hi, Avihai, >>>>>>> >>>>>>>> === Flow of operation === >>>>>>>> >>>>>>>> To use precopy initial data, the capability must be enabled in the >>>>>>>> source. >>>>>>>> >>>>>>>> As this capability must be supported also in the destination, a >>>>>>>> handshake is performed during migration setup. The purpose of the >>>>>>>> handshake is to notify the destination that precopy initial data is used >>>>>>>> and to check if it's supported. >>>>>>>> >>>>>>>> The handshake is done in two levels. First, a general handshake is done >>>>>>>> with the destination migration code to notify that precopy initial data >>>>>>>> is used. Then, for each migration user in the source that supports >>>>>>>> precopy initial data, a handshake is done with its counterpart in the >>>>>>>> destination: >>>>>>>> If both support it, precopy initial data will be used for them. >>>>>>>> If source doesn't support it, precopy initial data will not be used for >>>>>>>> them. >>>>>>>> If source supports it and destination doesn't, migration will be failed. >>>>>>>> >>>>>>>> Assuming the handshake succeeded, migration starts to send precopy data >>>>>>>> and as part of it also the initial precopy data. Initial precopy data is >>>>>>>> just like any other precopy data and as such, migration code is not >>>>>>>> aware of it. Therefore, it's the responsibility of the migration users >>>>>>>> (such as VFIO devices) to notify their counterparts in the destination >>>>>>>> that their initial precopy data has been sent (for example, VFIO >>>>>>>> migration does it when its initial bytes reach zero). >>>>>>>> >>>>>>>> In the destination, migration code will query each migration user that >>>>>>>> supports precopy initial data and check if its initial data has been >>>>>>>> loaded. If initial data has been loaded by all of them, an ACK will be >>>>>>>> sent to the source which will now be able to complete migration when >>>>>>>> appropriate. >>>>>>> I can understand why this is useful, what I'm not 100% sure is whether the >>>>>>> complexity is needed. The idea seems to be that src never switchover >>>>>>> unless it receives a READY notification from dst. >>>>>>> >>>>>>> I'm imaging below simplified and more general workflow, not sure whether it >>>>>>> could work for you: >>>>>>> >>>>>>> - Introduce a new cap "switchover-ready", it means whether there'll be a >>>>>>> ready event sent from dst -> src for "being ready for switchover" >>>>>>> >>>>>>> - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and >>>>>>> handled on src showing that dest is ready for switchover. It'll be sent >>>>>>> only if dest is ready for the switchover >>>>>>> >>>>>>> - Introduce a field SaveVMHandlers.explicit_switchover_needed. For each >>>>>>> special device like vfio that would like to participate in the decision >>>>>>> making, device can set its explicit_switchover_needed=1. This field is >>>>>>> ignored if the new cap is not set. >>>>>>> >>>>>>> - Dst qemu: when new cap set, remember how many special devices are there >>>>>>> requesting explicit switchover (count of SaveVMHandlers that has the >>>>>>> bit set during load setup) as switch_over_pending=N. >>>>>>> >>>>>>> - Dst qemu: Once a device thinks its fine to switchover (probably in the >>>>>>> load_state() callback), it calls migration_notify_switchover_ready(). >>>>>>> That decreases switch_over_pending and when it hits zero, one msg >>>>>>> MIG_RP_MSG_SWITCHOVER_READY will be sent to src. >>>>>>> >>>>>>> Only until READY msg received on src could src switchover the precopy to >>>>>>> dst. >>>>>>> >>>>>>> Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1 >>>>>>> more msg (dst->src). >>>>>>> >>>>>>> This is based on the fact that right now we always set caps on both qemus >>>>>>> so I suppose it already means either both have or don't have the feature >>>>>>> (even if one has, not setting the cap means disabled on both). >>>>>>> >>>>>>> Would it work for this case and cleaner? >>>>>> Hi Peter, thanks for the response! >>>>>> Your approach is indeed much simpler, however I have a few concerns >>>>>> regarding compatibility. >>>>>> >>>>>> You are saying that caps are always set both in src and dest. >>>>>> But what happens if we set the cap only on one side? >>>>>> Should we care about these scenarios? >>>>> I think it's not needed for now, but I am aware that this is a problem. >>>>> It's just that it is a more generic problem to me rather than very special >>>>> in the current feature being proposed. At least there're a few times >>>>> Daniel showed concern on keeping this way and hoped we can have a better >>>>> handshake in general with migration framework. >>>>> >>>>> I'd be perfectly fine if you want to approach this with a handshake >>>>> methodology, but I hope if so we should provide a more generic handshake. >>>>> So potentially that can make this new feature rely on the handshake work, >>>>> and slower to get into shape. Your call on how to address this, at least >>>>> fine by me either way. >>>> I'd really like this feature to get in, and I'm afraid making it dependent >>>> on first implementing a general migration handshake may take a long time, >>>> like you said. >>>> What about keeping current approach but changing it such that the capability >>>> will have to be set in both src and dest, to make it similar to other >>>> capability usages? >>>> I.e., we will remove the "general" handshake: >>>> >>>> /* Enable precopy initial data generally in the migration */ >>>> memset(&buf, 0, sizeof(buf)); >>>> buf.general_enable = 1; >>>> qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf), >>>> (uint8_t *)&buf); >>>> >>>> but keep the per-device handshake, which is not a handshake for migration >>>> capabilities, but a part of the protocol when the capability is set, like in >>>> multifd, postcopy, etc. >>>> This way we can advance with this feature while making the general migration >>>> handshake an independent effort. >>>> Will that work for you? >>> Yes it's fine by me. >>> >>>> BTW, with your suggestion to add a notification mechanism to notify when >>>> initial data is loaded in dest, I think we can drop these two SaveVMHandlers >>>> handlers: >>>> /* >>>> * Checks if precopy initial data is active. If it's inactive, >>>> * initial_data_loaded check is skipped. >>>> */ >>>> bool (*is_initial_data_active)(void *opaque); >>>> /* Checks if precopy initial data has been loaded in dest */ >>>> bool (*initial_data_loaded)(void *opaque); >>>> >>>>> In my imagination a generic handshake should happen at the very start of >>>>> migration and negociate feature bits between src/dst qemu, so they can >>>>> reach a consensus on what to do next. >>>>> >>>>>> For example, if we set the cap only in src, then src will wait indefinitely >>>>>> for dest to notify that switchover is ready. >>>>>> Would you expect migration to fail instead of just keep running >>>>>> indefinitely? >>>>>> In current approach we only need to enable the cap in the source, so such >>>>>> scenario can't happen. [1] >>>>>> Let's look at some other scenario. >>>>>> Src QEMU supports explicit-switchover for device X but *not* for device Y >>>>>> (i.e., src QEMU is some older version of QEMU that supports >>>>>> explicit-switchover for device X but not for Y). >>>>>> Dest QEMU supports explicit-switchover for device X and device Y. >>>>>> The capability is set in both src and dest. >>>>>> In the destination we will have switchover_pending=2 because both X and Y >>>>>> support explicit-switchover. >>>>>> We do migration, but switchover_pending will never reach 0 because only X >>>>>> supports it in the source, so the migration will run indefinitely. >>>>>> The per-device handshake solves this by making device Y not use >>>>>> explicit-switchover in this case. >>>>> Hmm, right. When I was replying obviously I thought that decision can be >>>>> made sololy by the dest qemu, then I assumed it's fine. Because IIUC in >>>>> that case how many devices that supports switchover_pending on src qemu >>>>> doesn't really matter but only dest. >>>>> >>>>> But I re-read the last patch and I do see that there's a new bit that will >>>>> change the device protocol of migration: >>>>> >>>>> if (migration->initial_data_active && !migration->precopy_init_size && >>>>> !migration->initial_data_sent) { >>>>> qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT); >>>>> migration->initial_data_sent = true; >>>>> } else { >>>>> qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); >>>>> } >>>>> >>>>> With this, I think what you said makes sense because then the src qemu >>>>> matters on deciding whether to send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT, it >>>>> also needs to make sure dst qemu will recognize it. >>>>> >>>>> Do you think this new VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is a must to have? >>>>> Can this decision be made on dest qemu only? >>>>> >>>>> To ask in another way, I saw that precopy_init_size is the fundation to >>>>> decide whether to send this flag. Then it's a matter of whether dest qemu >>>>> is also aware of precopy_init_size, then it can already tell when it's >>>>> ready to handle the switchover. >>>> The destination is not aware of precopy_init_size, only the source knows it. >>>> So the source must send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to notify dest that >>>> the initial data was sent. >>> Then, can the src qemu notify instead? >>> >>> We can have similar notification mechanism on src qemu and if that can work >>> we can further same the other MIG_RP_MSG. The counter counts how many > I meant s/same/save/ here.. > >>> special src devices are there and we don't switchover only if all agree. >> Do you mean the following: >> We will have the pending counter and notification mechanism in source >> instead of destination. >> The MIG_RP_MSG_INITIAL_DATA_LOADED_ACK message will be sent by each device >> in the destination that loaded its initial data. >> Each such RP_MSG will decrease the pending counter in source. >> When the pending counter in source reaches zero, source can complete >> migration. >> >> Or did I misunderstand you? > I meant the notification can come sololy from the src qemu, so src qemu can > skip the switchover if any of the src relevant device hasn't yet > acknowledged on the switchover. > > Then I think we can avoid introducing a MIG_RP msg? I'm not sure whether > I missed something, though. I stated my understanding on the ordering below. Oh, I see what you mean now. However, I don't think it will work -- knowing that source has sent the initial data alone is not enough. We must get an ACK from destination that it processed the initial data, as processing it can take time. For example: - We have one device that supports precopy initial data. - Loading the initial data takes 2 seconds. The device in the source just finished to send the initial data, so it notifies that it can do switchover. The source sees that the device has acked the switchover, plus the general pending precopy data is below the threshold, so it decides to stop the VM and complete migration. However, the destination is still processing the initial precopy data, which takes 2 seconds, during which the destination is blocked and can't process new migration data and complete the migration. The source VM is already stopped, so these 2 seconds are spent during downtime. What I suggested above, of having the notification mechanism in source and sending MIG_RP message for each device in the destination that loaded its initial data, solves the problem in the aforementioned scenario ([1] above). That's because switchover_pending will be 1 in source and will reach zero when device X in destination acks that its initial data is loaded, so migration could be completed. With that, we only need to add a single new MIG_RP message and the initial_data_advise() SaveVMHandlers handler. However, I'm not sure this will actually work. Supporting initial data may require special handling/preparation. Device Y in the destination may expect to get initial data, but device Y in source will not send it since it doesn't support it. Depending on device implementation, this may fail migration. The per-device advise can solve this by syncing between the devices in source and destination. For example, device Y in the destination will not get the advise MIG_CMD and thus will know not to expect initial data. Plus, in case device in source supports initial data but device in destination doesn't, migration will fail during setup and not only when destination device gets unexpected initial data. This gives a clearer and constant behavior. This will add some complexity of sending a new advise MIG_CMD per device in source and handling it in the destination, but I still tend to include the per-device advise for the reasons above. What's your opinion about it? Thanks. >>> I know that even if !precopy_init_size on src, it doesn't mean that dest >>> has already digested all the data in the send buffer. However since we'll >>> anyway make sure queued data landed before switch over happens (e.g., when >>> we only have 1 migration channel data are sent in sequential manner), it >>> means when switchover the dst qemu should have these loaded? >>> >>> Thanks, >>> >>> -- >>> Peter Xu >>> > -- > Peter Xu >
Avihai Horon <avihaih@nvidia.com> wrote: > On 03/05/2023 1:49, Peter Xu wrote: >> External email: Use caution opening links or attachments >> Only until READY msg received on src could src switchover the precopy to >> dst. >> >> Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1 >> more msg (dst->src). >> >> This is based on the fact that right now we always set caps on both qemus >> so I suppose it already means either both have or don't have the feature >> (even if one has, not setting the cap means disabled on both). >> >> Would it work for this case and cleaner? > > Hi Peter, thanks for the response! > Your approach is indeed much simpler, however I have a few concerns > regarding compatibility. > > You are saying that caps are always set both in src and dest. > But what happens if we set the cap only on one side? > Should we care about these scenarios? Not really. We are supposed that something like libvirt has set things up and such things are ok. We don't try to detect that kind of things in the migration stream (I am not telling we should'nt, but that we don't). If you configure qemu with an disk on source that is on source but not on destination, migration will work fine until the device copy stage, when that disk is missing. I think this is something like that. A missconfiguration. > For example, if we set the cap only in src, then src will wait > indefinitely for dest to notify that switchover is ready. > Would you expect migration to fail instead of just keep running > indefinitely? > In current approach we only need to enable the cap in the source, so > such scenario can't happen. I see. I have to think if this is a better approach. But will like to know what libvirt thinks about this. Daniel? > Let's look at some other scenario. > Src QEMU supports explicit-switchover for device X but *not* for > device Y (i.e., src QEMU is some older version of QEMU that supports > explicit-switchover for device X but not for Y). > Dest QEMU supports explicit-switchover for device X and device Y. > The capability is set in both src and dest. > In the destination we will have switchover_pending=2 because both X > and Y support explicit-switchover. > We do migration, but switchover_pending will never reach 0 because > only X supports it in the source, so the migration will run > indefinitely. > The per-device handshake solves this by making device Y not use > explicit-switchover in this case. You have a point here. But I will approach this case in a different way: Destination QEMU needs to be older, because it don't have the feature. So we need to NOT being able to do the switchover for older machine types. And have something like this is qemu/hw/machine.c GlobalProperty hw_compat_7_2[] = { { "our_device", "explicit-switchover", "off" }, }; Or whatever we want to call the device and the property, and not use it for older machine types to allow migration for that. Once told that, this is the "ideal" world. In general we don't force this because we are not good at detecting this kind of failures. Later, Juan.
On 10/05/2023 12:12, Juan Quintela wrote: > External email: Use caution opening links or attachments > > > Avihai Horon <avihaih@nvidia.com> wrote: >> On 03/05/2023 1:49, Peter Xu wrote: >>> External email: Use caution opening links or attachments >>> Only until READY msg received on src could src switchover the precopy to >>> dst. >>> >>> Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1 >>> more msg (dst->src). >>> >>> This is based on the fact that right now we always set caps on both qemus >>> so I suppose it already means either both have or don't have the feature >>> (even if one has, not setting the cap means disabled on both). >>> >>> Would it work for this case and cleaner? >> Hi Peter, thanks for the response! >> Your approach is indeed much simpler, however I have a few concerns >> regarding compatibility. >> >> You are saying that caps are always set both in src and dest. >> But what happens if we set the cap only on one side? >> Should we care about these scenarios? > Not really. > We are supposed that something like libvirt has set things up and such > things are ok. We don't try to detect that kind of things in the > migration stream (I am not telling we should'nt, but that we don't). > > If you configure qemu with an disk on source that is on source but not > on destination, migration will work fine until the device copy stage, > when that disk is missing. I think this is something like that. A > missconfiguration. Ah, I understand. > >> For example, if we set the cap only in src, then src will wait >> indefinitely for dest to notify that switchover is ready. >> Would you expect migration to fail instead of just keep running >> indefinitely? >> In current approach we only need to enable the cap in the source, so >> such scenario can't happen. > I see. I have to think if this is a better approach. But will like to > know what libvirt thinks about this. > > Daniel? > > >> Let's look at some other scenario. >> Src QEMU supports explicit-switchover for device X but *not* for >> device Y (i.e., src QEMU is some older version of QEMU that supports >> explicit-switchover for device X but not for Y). >> Dest QEMU supports explicit-switchover for device X and device Y. >> The capability is set in both src and dest. >> In the destination we will have switchover_pending=2 because both X >> and Y support explicit-switchover. >> We do migration, but switchover_pending will never reach 0 because >> only X supports it in the source, so the migration will run >> indefinitely. >> The per-device handshake solves this by making device Y not use >> explicit-switchover in this case. > You have a point here. > But I will approach this case in a different way: > > Destination QEMU needs to be older, because it don't have the feature. > So we need to NOT being able to do the switchover for older machine > types. > And have something like this is qemu/hw/machine.c > > GlobalProperty hw_compat_7_2[] = { > { "our_device", "explicit-switchover", "off" }, > }; > > Or whatever we want to call the device and the property, and not use it > for older machine types to allow migration for that. Let me see if I get this straight (I'm not that familiar with hw_compat_x_y): You mean that device Y which adds support for explicit-switchover in QEMU version Z should add a property like you wrote above, and use it to disable explicit-switchover usage for Y devices when Y device from QEMU older than Z is migrated? Thanks. > > Once told that, this is the "ideal" world. In general we don't force > this because we are not good at detecting this kind of failures. > > Later, Juan. >
Avihai Horon <avihaih@nvidia.com> wrote: >> You have a point here. >> But I will approach this case in a different way: >> >> Destination QEMU needs to be older, because it don't have the feature. >> So we need to NOT being able to do the switchover for older machine >> types. >> And have something like this is qemu/hw/machine.c >> >> GlobalProperty hw_compat_7_2[] = { >> { "our_device", "explicit-switchover", "off" }, >> }; >> >> Or whatever we want to call the device and the property, and not use it >> for older machine types to allow migration for that. > > Let me see if I get this straight (I'm not that familiar with > hw_compat_x_y): > > You mean that device Y which adds support for explicit-switchover in > QEMU version Z should add a property > like you wrote above, and use it to disable explicit-switchover usage > for Y devices when Y device > from QEMU older than Z is migrated? More that "from" "to" Let me elaborate. We have two QEMUs: QEMU version X, has device dev. Let's call it qemu-X. QEMU version Y (X+1) add feature foo to device dev. Let's call it qemu-Y. We have two machine types (for this exercise we don't care about architectures) PC-X.0 PC-Y.0 So, the possible combinations are: First the easy cases, same qemu on both sides. Different machine types. $ qemu-X -M PC-X.0 -> to -> qemu-X -M PC-X.0 good. neither guest use feature foo. $ qemu-X -M PC-Y.0 -> to -> qemu-X -M PC-Y.0 impossible. qemu-X don't have machine PC-Y.0. So nothing to see here. $ qemu-Y -M PC-X.0 -> to -> qemu-Y -M PC-X.0 good. We have feature foo in both sides. Notice that I recomend here to not use feature foo. We will see on the difficult cases. $ qemu-Y -M PC-Y.0 -> to -> qemu-Y -M PC-Y.0 good. Both sides use feature foo. Difficult cases, when we mix qemu versions. $ qemu-X -M PC-X.0 -> to -> qemu-Y -M PC-X.0 source don't have feature foo. Destination have feature foo. But if we disable it for machine PC-X.0, it will work. $ qemu-Y -M PC-X.0 -> to -> qemu-X -M PC-X.0 same than previous example. But here we have feature foo on source and not on destination. Disabling it for machine PC-X.0 fixes the problem. We can't migrate a PC-Y.0 when one of the qemu's is qemu-X, so that case is impossible. Does this makes more sense? And now, how hw_compat_X_Y works. It is an array of registers with the format: - name of device (we give some rope here, for instance migration is a device in this context) - name of property: self explanatory. The important bit is that we can get the value of the property in the device driver. - value of the property: self explanatory. With this mechanism what we do when we add a feature to a device that matters for migration is: - for the machine type of the version that we are "developing" feature is enabled by default. For whatever that enable means. - for old machine types we disable the feature, so it can be migrate freely with old qemu. But using the old machine type. - there is way to enable the feature on the command line even for old machine types on new qemu, but only developers use that for testing. Normal users/admins never do that. what does hw_compat_7_2 means? Ok, we need to know the versions. New version is 8.0. hw_compat_7_2 has all the properties represensing "features", defaults, whatever that has changed since 7.2. In other words, what features we need to disable to get to the features that existed when 7.2 was released. To go to a real example. In the development tree. We have: GlobalProperty hw_compat_8_0[] = { { "migration", "multifd-flush-after-each-section", "on"}, }; const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0); Feature is implemented in the following commits: 77c259a4cb1c9799754b48f570301ebf1de5ded8 b05292c237030343516d073b1a1e5f49ffc017a8 294e5a4034e81b3d8db03b4e0f691386f20d6ed3 When we are doing migration with multifd and we pass the end of memory (i.e. we end one iteration through all the RAM) we need to make sure that we don't send the same page through two channels, i.e. contents of the page at iteration 1 through channel 1 and contents of the page at iteration 2 through channel 2. The problem is that they could arrive out of order and the of page of iteration 1 arrive later than iteration 2 and overwrite new data with old data. Which is undesirable. We could use complex algorithms to fix that, but one easy way of doing it is: - When we finish a run through all memory (i.e.) one iteration, we flush all channels and make sure that everything arrives to destination before starting sending data o the next iteration. I call that synchronize all channels. And that is what we *should* have done. But when I implemented the feature, I did this synchronization everytime that we finish a cycle (around 100miliseconds). i.e. 10 times per second. This is called a section for historical reasons. And when you are migrating multiterabytes RAM machines with very fast networking, we end waiting too much on the synchronizations. Once detected the problem and found the cause, we change that. The problem is that if we are running an old qemu against a new qemu (or viceversa) we would not be able to migrate, because one send/expects synchronizations at different points. So we have to maintain the old algorithm and the new algoritm. That is what we did here. For machines older than <current in development>, i.e. 8.0 we use the old algorithm (multifd-flush-after-earch section is "on"). But the default for new machine types is the new algorithm, much faster. I know that the explanation has been quite long, but inventing an example was going to be even more complex. Does this makes sense? Later, Juan.
On 10/05/2023 19:41, Juan Quintela wrote: > External email: Use caution opening links or attachments > > > Avihai Horon <avihaih@nvidia.com> wrote: > >>> You have a point here. >>> But I will approach this case in a different way: >>> >>> Destination QEMU needs to be older, because it don't have the feature. >>> So we need to NOT being able to do the switchover for older machine >>> types. >>> And have something like this is qemu/hw/machine.c >>> >>> GlobalProperty hw_compat_7_2[] = { >>> { "our_device", "explicit-switchover", "off" }, >>> }; >>> >>> Or whatever we want to call the device and the property, and not use it >>> for older machine types to allow migration for that. >> Let me see if I get this straight (I'm not that familiar with >> hw_compat_x_y): >> >> You mean that device Y which adds support for explicit-switchover in >> QEMU version Z should add a property >> like you wrote above, and use it to disable explicit-switchover usage >> for Y devices when Y device >> from QEMU older than Z is migrated? > More that "from" "to" > > Let me elaborate. We have two QEMUs: > > QEMU version X, has device dev. Let's call it qemu-X. > QEMU version Y (X+1) add feature foo to device dev. Let's call it qemu-Y. > > We have two machine types (for this exercise we don't care about > architectures) > > PC-X.0 > PC-Y.0 > > So, the possible combinations are: > > First the easy cases, same qemu on both sides. Different machine types. > > $ qemu-X -M PC-X.0 -> to -> qemu-X -M PC-X.0 > > good. neither guest use feature foo. > > $ qemu-X -M PC-Y.0 -> to -> qemu-X -M PC-Y.0 > > impossible. qemu-X don't have machine PC-Y.0. So nothing to see here. > > $ qemu-Y -M PC-X.0 -> to -> qemu-Y -M PC-X.0 > > good. We have feature foo in both sides. Notice that I recomend here > to not use feature foo. We will see on the difficult cases. > > $ qemu-Y -M PC-Y.0 -> to -> qemu-Y -M PC-Y.0 > > good. Both sides use feature foo. > > Difficult cases, when we mix qemu versions. > > $ qemu-X -M PC-X.0 -> to -> qemu-Y -M PC-X.0 > > source don't have feature foo. Destination have feature foo. > But if we disable it for machine PC-X.0, it will work. > > $ qemu-Y -M PC-X.0 -> to -> qemu-X -M PC-X.0 > > same than previous example. But here we have feature foo on source > and not on destination. Disabling it for machine PC-X.0 fixes the > problem. > > We can't migrate a PC-Y.0 when one of the qemu's is qemu-X, so that case > is impossible. > > Does this makes more sense? > > And now, how hw_compat_X_Y works. > > It is an array of registers with the format: > > - name of device (we give some rope here, for instance migration is a > device in this context) > > - name of property: self explanatory. The important bit is that > we can get the value of the property in the device driver. > > - value of the property: self explanatory. > > With this mechanism what we do when we add a feature to a device that > matters for migration is: > - for the machine type of the version that we are "developing" feature > is enabled by default. For whatever that enable means. > > - for old machine types we disable the feature, so it can be migrate > freely with old qemu. But using the old machine type. > > - there is way to enable the feature on the command line even for old > machine types on new qemu, but only developers use that for testing. > Normal users/admins never do that. > > what does hw_compat_7_2 means? > > Ok, we need to know the versions. New version is 8.0. > > hw_compat_7_2 has all the properties represensing "features", defaults, > whatever that has changed since 7.2. In other words, what features we > need to disable to get to the features that existed when 7.2 was > released. > > To go to a real example. > > In the development tree. We have: > > GlobalProperty hw_compat_8_0[] = { > { "migration", "multifd-flush-after-each-section", "on"}, > }; > const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0); > > Feature is implemented in the following commits: > > 77c259a4cb1c9799754b48f570301ebf1de5ded8 > b05292c237030343516d073b1a1e5f49ffc017a8 > 294e5a4034e81b3d8db03b4e0f691386f20d6ed3 > > When we are doing migration with multifd and we pass the end of memory > (i.e. we end one iteration through all the RAM) we need to make sure > that we don't send the same page through two channels, i.e. contents of > the page at iteration 1 through channel 1 and contents of the page at > iteration 2 through channel 2. The problem is that they could arrive > out of order and the of page of iteration 1 arrive later than iteration > 2 and overwrite new data with old data. Which is undesirable. > We could use complex algorithms to fix that, but one easy way of doing > it is: > > - When we finish a run through all memory (i.e.) one iteration, we flush > all channels and make sure that everything arrives to destination > before starting sending data o the next iteration. I call that > synchronize all channels. > > And that is what we *should* have done. But when I implemented the > feature, I did this synchronization everytime that we finish a cycle > (around 100miliseconds). i.e. 10 times per second. This is called a > section for historical reasons. And when you are migrating > multiterabytes RAM machines with very fast networking, we end waiting > too much on the synchronizations. > > Once detected the problem and found the cause, we change that. The > problem is that if we are running an old qemu against a new qemu (or > viceversa) we would not be able to migrate, because one send/expects > synchronizations at different points. > > So we have to maintain the old algorithm and the new algoritm. That is > what we did here. For machines older than <current in development>, > i.e. 8.0 we use the old algorithm (multifd-flush-after-earch section is > "on"). > > But the default for new machine types is the new algorithm, much faster. > > I know that the explanation has been quite long, but inventing an > example was going to be even more complex. > > Does this makes sense? Yes, thanks a lot for the full and detailed explanation! This indeed solves the problem in the scenario I mentioned above. However, this relies on the fact that a device support for this feature depends only on the QEMU version. This is not the case for VFIO devices. To support explicit-switchover, a VFIO device also needs host kernel support for VFIO precopy, i.e., it needs to have the VFIO_MIGRATION_PRE_COPY flag set. So, theoretically we could have the following: - Source and destination QEMU are the same version. - We migrate two different VFIO devices (i.e., they don't share the same kernel driver), device X and device Y. - Host kernel in source supports VIFO precopy for device X but not for device Y. - Host kernel in destination supports VFIO precopy for both device X and device Y. Without explicit-switchover, migration should work. But if we enable explicit-switchover and do migration, we would end up in the same situation where switchover_pending=2 in destination and it never reaches zero so migration is stuck. This could be solved by moving the switchover_pending counter to the source and sending multiple MIG_RP explicit-switchover ACK messages. However, I also raised a concern about this in my last mail to Peter [1], where this is not guaranteed to work, depending on the device implementation for explicit-switchover feature. Not sure though if I'm digging too deep in some improbable future corner cases. Let's go back to the basic question, which is whether we need to send an "advise" message for each device that supports explicit-switchover. I think it gives us more flexibility and although not needed at the moment, might be useful in the future. If you want I can send a v2 that addresses the comments and simplifies the code in other areas and we'll continue discussing the necessity of the "advise" message then. Thanks! [1] https://lore.kernel.org/qemu-devel/688acb4e-a4e6-428d-9124-7596e3666133@nvidia.com/
Avihai Horon <avihaih@nvidia.com> wrote: > On 10/05/2023 19:41, Juan Quintela wrote: >> Does this makes sense? > > Yes, thanks a lot for the full and detailed explanation! Thank you. > This indeed solves the problem in the scenario I mentioned above. > > However, this relies on the fact that a device support for this > feature depends only on the QEMU version. > This is not the case for VFIO devices. What a surprise :-) Yes, I couldn't resist. > To support explicit-switchover, a VFIO device also needs host kernel > support for VFIO precopy, i.e., it needs to have the > VFIO_MIGRATION_PRE_COPY flag set. > So, theoretically we could have the following: > - Source and destination QEMU are the same version. > - We migrate two different VFIO devices (i.e., they don't share the > same kernel driver), device X and device Y. > - Host kernel in source supports VIFO precopy for device X but not for > device Y. > - Host kernel in destination supports VFIO precopy for both device X > and device Y. > Without explicit-switchover, migration should work. > But if we enable explicit-switchover and do migration, we would end up > in the same situation where switchover_pending=2 in destination and it > never reaches zero so migration is stuck. I think this is too much for qemu. You need to work at the libvirt/management level. > This could be solved by moving the switchover_pending counter to the > source and sending multiple MIG_RP explicit-switchover ACK messages. > However, I also raised a concern about this in my last mail to Peter > [1], where this is not guaranteed to work, depending on the device > implementation for explicit-switchover feature. I will not try to be extra clever here. We have removed qemu support of the question, as it is the same qemu in both sides. So what we have is this configuration: Host A ------ device X explicit_switchoever=on device Y explicit_switchoever=off Host B ------ device X explicit_switchoever=on device Y explicit_switchoever=on The configuration is different. That is something that qemu protocol don't know how to handle, and it is up to stack. You need to configure explicitely in qemu command line on host B: device=Y,explicit_switchover=off Or whatever is that configured off. It is exactly the same problem than: Host A ------ Intel CPU genX Host B ------ intel CPU genX-1 i.e. there are features that Host A has but host B don't have. The only way to make this work is that you need to configure qemu when launched in Host A with a cpu type that host B is able to run (i.e. one that don't have any features that Host B is missing). What is the difference between this and yours? > Not sure though if I'm digging too deep in some improbable future > corner cases. Oh, you are just starting. The compat layers that CPU have had to do over the years. At some point even migration between AMD and Intel CPU's worked. > Let's go back to the basic question, which is whether we need to send > an "advise" message for each device that supports explicit-switchover. > I think it gives us more flexibility and although not needed at the > moment, might be useful in the future. I think that is not a good idea, see my previous comment. We have two cases: - both devices have the same features in both places - they have different features in any of the places First case, we don't care. It always work. Second case, we need to configure it correctly, and that means disable features that are not on the other side. > If you want I can send a v2 that addresses the comments and simplifies > the code in other areas and we'll continue discussing the necessity of > the "advise" message then. Yeap. I think is the best course of action. Thanks, Juan. > Thanks! > > [1] > https://lore.kernel.org/qemu-devel/688acb4e-a4e6-428d-9124-7596e3666133@nvidia.com/
On 11/05/2023 16:09, Juan Quintela wrote: > External email: Use caution opening links or attachments > > > Avihai Horon <avihaih@nvidia.com> wrote: >> On 10/05/2023 19:41, Juan Quintela wrote: >>> Does this makes sense? >> Yes, thanks a lot for the full and detailed explanation! > Thank you. > >> This indeed solves the problem in the scenario I mentioned above. >> >> However, this relies on the fact that a device support for this >> feature depends only on the QEMU version. >> This is not the case for VFIO devices. > What a surprise :-) > > Yes, I couldn't resist. > >> To support explicit-switchover, a VFIO device also needs host kernel >> support for VFIO precopy, i.e., it needs to have the >> VFIO_MIGRATION_PRE_COPY flag set. >> So, theoretically we could have the following: >> - Source and destination QEMU are the same version. >> - We migrate two different VFIO devices (i.e., they don't share the >> same kernel driver), device X and device Y. >> - Host kernel in source supports VIFO precopy for device X but not for >> device Y. >> - Host kernel in destination supports VFIO precopy for both device X >> and device Y. >> Without explicit-switchover, migration should work. >> But if we enable explicit-switchover and do migration, we would end up >> in the same situation where switchover_pending=2 in destination and it >> never reaches zero so migration is stuck. > I think this is too much for qemu. You need to work at the > libvirt/management level. > >> This could be solved by moving the switchover_pending counter to the >> source and sending multiple MIG_RP explicit-switchover ACK messages. >> However, I also raised a concern about this in my last mail to Peter >> [1], where this is not guaranteed to work, depending on the device >> implementation for explicit-switchover feature. > I will not try to be extra clever here. We have removed qemu support of > the question, as it is the same qemu in both sides. > > So what we have is this configuration: > > Host A > ------ > device X explicit_switchoever=on > device Y explicit_switchoever=off > > Host B > ------ > device X explicit_switchoever=on > device Y explicit_switchoever=on > > The configuration is different. That is something that qemu protocol > don't know how to handle, and it is up to stack. > > You need to configure explicitely in qemu command line on host B: > device=Y,explicit_switchover=off > > Or whatever is that configured off. I understand. > > It is exactly the same problem than: > > Host A > ------ > > Intel CPU genX > > Host B > ------ > > intel CPU genX-1 > > i.e. there are features that Host A has but host B don't have. The only > way to make this work is that you need to configure qemu when launched > in Host A with a cpu type that host B is able to run (i.e. one that > don't have any features that Host B is missing). > > What is the difference between this and yours? Hmm, yes, I see your point. > > >> Not sure though if I'm digging too deep in some improbable future >> corner cases. > Oh, you are just starting. The compat layers that CPU have had to do > over the years. At some point even migration between AMD and Intel > CPU's worked. > >> Let's go back to the basic question, which is whether we need to send >> an "advise" message for each device that supports explicit-switchover. >> I think it gives us more flexibility and although not needed at the >> moment, might be useful in the future. > I think that is not a good idea, see my previous comment. We have two > cases: > - both devices have the same features in both places > - they have different features in any of the places > > First case, we don't care. It always work. > Second case, we need to configure it correctly, and that means disable > features that are not on the other side. Yep, I understand. > >> If you want I can send a v2 that addresses the comments and simplifies >> the code in other areas and we'll continue discussing the necessity of >> the "advise" message then. > Yeap. I think is the best course of action. OK, so let me digest all the new info of this discussion and get back with v2 / conclusions / questions. Thanks for all the help!