Message ID | 20220920223800.47467-2-peterx@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | migration: Bug fixes (prepare for preempt-full) | expand |
* Peter Xu (peterx@redhat.com) wrote: > When starting ram saving procedure (especially at the completion phase), > always set last_seen_block to non-NULL to make sure we can always correctly > detect the case where "we've migrated all the dirty pages". > > Then we'll guarantee both last_seen_block and pss.block will be valid > always before the loop starts. > > See the comment in the code for some details. > > Signed-off-by: Peter Xu <peterx@redhat.com> Yeh I guess it can currently only happen during restart? Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> > --- > migration/ram.c | 16 ++++++++++++---- > 1 file changed, 12 insertions(+), 4 deletions(-) > > diff --git a/migration/ram.c b/migration/ram.c > index dc1de9ddbc..1d42414ecc 100644 > --- a/migration/ram.c > +++ b/migration/ram.c > @@ -2546,14 +2546,22 @@ static int ram_find_and_save_block(RAMState *rs) > return pages; > } > > + /* > + * Always keep last_seen_block/last_page valid during this procedure, > + * because find_dirty_block() relies on these values (e.g., we compare > + * last_seen_block with pss.block to see whether we searched all the > + * ramblocks) to detect the completion of migration. Having NULL value > + * of last_seen_block can conditionally cause below loop to run forever. > + */ > + if (!rs->last_seen_block) { > + rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks); > + rs->last_page = 0; > + } > + > pss.block = rs->last_seen_block; > pss.page = rs->last_page; > pss.complete_round = false; > > - if (!pss.block) { > - pss.block = QLIST_FIRST_RCU(&ram_list.blocks); > - } > - > do { > again = true; > found = get_queued_page(rs, &pss); > -- > 2.32.0 >
On Thu, Sep 22, 2022 at 03:49:38PM +0100, Dr. David Alan Gilbert wrote: > * Peter Xu (peterx@redhat.com) wrote: > > When starting ram saving procedure (especially at the completion phase), > > always set last_seen_block to non-NULL to make sure we can always correctly > > detect the case where "we've migrated all the dirty pages". > > > > Then we'll guarantee both last_seen_block and pss.block will be valid > > always before the loop starts. > > > > See the comment in the code for some details. > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > Yeh I guess it can currently only happen during restart? There're only two places to clear last_seen_block: ram_state_reset[2683] rs->last_seen_block = NULL; ram_postcopy_send_discard_bitmap[2876] rs->last_seen_block = NULL; Where for the reset case: ram_state_init[2994] ram_state_reset(*rsp); ram_state_resume_prepare[3110] ram_state_reset(rs); ram_save_iterate[3271] ram_state_reset(rs); So I think it can at least happen in two places, either (1) postcopy just started (assume when postcopy starts accidentally when all dirty pages were migrated?), or (2) postcopy recover from failure. In my case I triggered this deadloop when I was debugging the other bug fixed by the next patch where it was postcopy recovery (on tls), but only once.. So currently I'm still not 100% sure whether this is the same problem, but logically it could trigger. I also remember I used to hit very rare deadloops before too, maybe they're the same thing because I did test recovery a lot. > > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Thanks!
* Peter Xu (peterx@redhat.com) wrote: > On Thu, Sep 22, 2022 at 03:49:38PM +0100, Dr. David Alan Gilbert wrote: > > * Peter Xu (peterx@redhat.com) wrote: > > > When starting ram saving procedure (especially at the completion phase), > > > always set last_seen_block to non-NULL to make sure we can always correctly > > > detect the case where "we've migrated all the dirty pages". > > > > > > Then we'll guarantee both last_seen_block and pss.block will be valid > > > always before the loop starts. > > > > > > See the comment in the code for some details. > > > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > > Yeh I guess it can currently only happen during restart? > > There're only two places to clear last_seen_block: > > ram_state_reset[2683] rs->last_seen_block = NULL; > ram_postcopy_send_discard_bitmap[2876] rs->last_seen_block = NULL; > > Where for the reset case: > > ram_state_init[2994] ram_state_reset(*rsp); > ram_state_resume_prepare[3110] ram_state_reset(rs); > ram_save_iterate[3271] ram_state_reset(rs); > > So I think it can at least happen in two places, either (1) postcopy just > started (assume when postcopy starts accidentally when all dirty pages were > migrated?), or (2) postcopy recover from failure. Oh, (1) is a more general problem then; yeh. > In my case I triggered this deadloop when I was debugging the other bug > fixed by the next patch where it was postcopy recovery (on tls), but only > once.. So currently I'm still not 100% sure whether this is the same > problem, but logically it could trigger. > > I also remember I used to hit very rare deadloops before too, maybe they're > the same thing because I did test recovery a lot. Note; 'deadlock' not 'deadloop'. Dave > > > > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> > > Thanks! > > -- > Peter Xu >
On Thu, Sep 22, 2022 at 05:41:30PM +0100, Dr. David Alan Gilbert wrote: > * Peter Xu (peterx@redhat.com) wrote: > > On Thu, Sep 22, 2022 at 03:49:38PM +0100, Dr. David Alan Gilbert wrote: > > > * Peter Xu (peterx@redhat.com) wrote: > > > > When starting ram saving procedure (especially at the completion phase), > > > > always set last_seen_block to non-NULL to make sure we can always correctly > > > > detect the case where "we've migrated all the dirty pages". > > > > > > > > Then we'll guarantee both last_seen_block and pss.block will be valid > > > > always before the loop starts. > > > > > > > > See the comment in the code for some details. > > > > > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > > > > Yeh I guess it can currently only happen during restart? > > > > There're only two places to clear last_seen_block: > > > > ram_state_reset[2683] rs->last_seen_block = NULL; > > ram_postcopy_send_discard_bitmap[2876] rs->last_seen_block = NULL; > > > > Where for the reset case: > > > > ram_state_init[2994] ram_state_reset(*rsp); > > ram_state_resume_prepare[3110] ram_state_reset(rs); > > ram_save_iterate[3271] ram_state_reset(rs); > > > > So I think it can at least happen in two places, either (1) postcopy just > > started (assume when postcopy starts accidentally when all dirty pages were > > migrated?), or (2) postcopy recover from failure. > > Oh, (1) is a more general problem then; yeh. > > > In my case I triggered this deadloop when I was debugging the other bug > > fixed by the next patch where it was postcopy recovery (on tls), but only > > once.. So currently I'm still not 100% sure whether this is the same > > problem, but logically it could trigger. > > > > I also remember I used to hit very rare deadloops before too, maybe they're > > the same thing because I did test recovery a lot. > > Note; 'deadlock' not 'deadloop'. (Oops I somehow forgot there's still this series pending..) Here it's not about a lock, or maybe I should add a space ("dead loop")?
* Peter Xu (peterx@redhat.com) wrote: > On Thu, Sep 22, 2022 at 05:41:30PM +0100, Dr. David Alan Gilbert wrote: > > * Peter Xu (peterx@redhat.com) wrote: > > > On Thu, Sep 22, 2022 at 03:49:38PM +0100, Dr. David Alan Gilbert wrote: > > > > * Peter Xu (peterx@redhat.com) wrote: > > > > > When starting ram saving procedure (especially at the completion phase), > > > > > always set last_seen_block to non-NULL to make sure we can always correctly > > > > > detect the case where "we've migrated all the dirty pages". > > > > > > > > > > Then we'll guarantee both last_seen_block and pss.block will be valid > > > > > always before the loop starts. > > > > > > > > > > See the comment in the code for some details. > > > > > > > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > > > > > > Yeh I guess it can currently only happen during restart? > > > > > > There're only two places to clear last_seen_block: > > > > > > ram_state_reset[2683] rs->last_seen_block = NULL; > > > ram_postcopy_send_discard_bitmap[2876] rs->last_seen_block = NULL; > > > > > > Where for the reset case: > > > > > > ram_state_init[2994] ram_state_reset(*rsp); > > > ram_state_resume_prepare[3110] ram_state_reset(rs); > > > ram_save_iterate[3271] ram_state_reset(rs); > > > > > > So I think it can at least happen in two places, either (1) postcopy just > > > started (assume when postcopy starts accidentally when all dirty pages were > > > migrated?), or (2) postcopy recover from failure. > > > > Oh, (1) is a more general problem then; yeh. > > > > > In my case I triggered this deadloop when I was debugging the other bug > > > fixed by the next patch where it was postcopy recovery (on tls), but only > > > once.. So currently I'm still not 100% sure whether this is the same > > > problem, but logically it could trigger. > > > > > > I also remember I used to hit very rare deadloops before too, maybe they're > > > the same thing because I did test recovery a lot. > > > > Note; 'deadlock' not 'deadloop'. > > (Oops I somehow forgot there's still this series pending..) > > Here it's not about a lock, or maybe I should add a space ("dead loop")? So the normal phrases I'm used to are: 'deadlock' - two threads waiting for each other 'livelock' - two threads spinning for each other Dave > -- > Peter Xu >
diff --git a/migration/ram.c b/migration/ram.c index dc1de9ddbc..1d42414ecc 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -2546,14 +2546,22 @@ static int ram_find_and_save_block(RAMState *rs) return pages; } + /* + * Always keep last_seen_block/last_page valid during this procedure, + * because find_dirty_block() relies on these values (e.g., we compare + * last_seen_block with pss.block to see whether we searched all the + * ramblocks) to detect the completion of migration. Having NULL value + * of last_seen_block can conditionally cause below loop to run forever. + */ + if (!rs->last_seen_block) { + rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks); + rs->last_page = 0; + } + pss.block = rs->last_seen_block; pss.page = rs->last_page; pss.complete_round = false; - if (!pss.block) { - pss.block = QLIST_FIRST_RCU(&ram_list.blocks); - } - do { again = true; found = get_queued_page(rs, &pss);
When starting ram saving procedure (especially at the completion phase), always set last_seen_block to non-NULL to make sure we can always correctly detect the case where "we've migrated all the dirty pages". Then we'll guarantee both last_seen_block and pss.block will be valid always before the loop starts. See the comment in the code for some details. Signed-off-by: Peter Xu <peterx@redhat.com> --- migration/ram.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-)