diff mbox series

[v3,3/7] support UFFD write fault processing in ram_save_iterate()

Message ID 20201119125940.20017-4-andrey.gruzdev@virtuozzo.com (mailing list archive)
State New, archived
Headers show
Series UFFD write-tracking migration/snapshots | expand

Commit Message

Andrey Gruzdev Nov. 19, 2020, 12:59 p.m. UTC
In this particular implementation the same single migration
thread is responsible for both normal linear dirty page
migration and procesing UFFD page fault events.

Processing write faults includes reading UFFD file descriptor,
finding respective RAM block and saving faulting page to
the migration stream. After page has been saved, write protection
can be removed. Since asynchronous version of qemu_put_buffer()
is expected to be used to save pages, we also have to flush
migraion stream prior to un-protecting saved memory range.

Write protection is being removed for any previously protected
memory chunk that has hit the migration stream. That's valid
for pages from linear page scan along with write fault pages.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 migration/ram.c | 124 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 115 insertions(+), 9 deletions(-)

Comments

Peter Xu Nov. 19, 2020, 6:25 p.m. UTC | #1
On Thu, Nov 19, 2020 at 03:59:36PM +0300, Andrey Gruzdev via wrote:

[...]

> +/**
> + * ram_find_block_by_host_address: find RAM block containing host page
> + *
> + * Returns true if RAM block is found and pss->block/page are
> + * pointing to the given host page, false in case of an error
> + *
> + * @rs: current RAM state
> + * @pss: page-search-status structure
> + */
> +static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
> +        hwaddr page_address)
> +{
> +    bool found = false;
> +
> +    pss->block = rs->last_seen_block;
> +    do {
> +        if (page_address >= (hwaddr) pss->block->host &&
> +            (page_address + TARGET_PAGE_SIZE) <=
> +                    ((hwaddr) pss->block->host + pss->block->used_length)) {
> +            pss->page = (unsigned long)
> +                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
> +            found = true;
> +            break;
> +        }
> +
> +        pss->block = QLIST_NEXT_RCU(pss->block, next);
> +        if (!pss->block) {
> +            /* Hit the end of the list */
> +            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
> +        }
> +    } while (pss->block != rs->last_seen_block);
> +
> +    rs->last_seen_block = pss->block;
> +    /*
> +     * Since we are in the same loop with ram_find_and_save_block(),
> +     * need to reset pss->complete_round after switching to
> +     * other block/page in pss.
> +     */
> +    pss->complete_round = false;
> +
> +    return found;
> +}

I forgot whether Denis and I have discussed this, but I'll try anyways... do
you think we can avoid touching PageSearchStatus at all?

PageSearchStatus is used to track a single migration iteration for precopy, so
that we scan from the 1st ramblock until the last one.  Then we finish one
iteration.

Snapshot is really something, imho, that can easily leverage this structure
without touching it - basically we want to do two things:

  - Do the 1st iteration of precopy (when ram_bulk_stage==true), and do that
    only.  We never need the 2nd, 3rd, ... iterations because we're snapshoting.

  - Leverage the postcopy queue mechanism so that when some page got written,
    queue that page.  We should have this queue higher priority than the
    precopy scanning mentioned above.

As long as we follow above rules, then after the above 1st round precopy, we're
simply done...  If that works, the whole logic of precopy and PageSearchStatus
does not need to be touched, iiuc.

[...]

> @@ -2086,7 +2191,8 @@ static void ram_state_reset(RAMState *rs)
>      rs->last_sent_block = NULL;
>      rs->last_page = 0;
>      rs->last_version = ram_list.version;
> -    rs->ram_bulk_stage = true;
> +    rs->ram_wt_enabled = migrate_track_writes_ram();

Maybe we don't need ram_wt_enabled, but just call migrate_track_writes_ram()
anywhere needed (actually, only in get_fault_page, once).

Thanks,
Andrey Gruzdev Nov. 20, 2020, 10:44 a.m. UTC | #2
On 19.11.2020 21:25, Peter Xu wrote:
> On Thu, Nov 19, 2020 at 03:59:36PM +0300, Andrey Gruzdev via wrote:
> 
> [...]
> 
>> +/**
>> + * ram_find_block_by_host_address: find RAM block containing host page
>> + *
>> + * Returns true if RAM block is found and pss->block/page are
>> + * pointing to the given host page, false in case of an error
>> + *
>> + * @rs: current RAM state
>> + * @pss: page-search-status structure
>> + */
>> +static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
>> +        hwaddr page_address)
>> +{
>> +    bool found = false;
>> +
>> +    pss->block = rs->last_seen_block;
>> +    do {
>> +        if (page_address >= (hwaddr) pss->block->host &&
>> +            (page_address + TARGET_PAGE_SIZE) <=
>> +                    ((hwaddr) pss->block->host + pss->block->used_length)) {
>> +            pss->page = (unsigned long)
>> +                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
>> +            found = true;
>> +            break;
>> +        }
>> +
>> +        pss->block = QLIST_NEXT_RCU(pss->block, next);
>> +        if (!pss->block) {
>> +            /* Hit the end of the list */
>> +            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
>> +        }
>> +    } while (pss->block != rs->last_seen_block);
>> +
>> +    rs->last_seen_block = pss->block;
>> +    /*
>> +     * Since we are in the same loop with ram_find_and_save_block(),
>> +     * need to reset pss->complete_round after switching to
>> +     * other block/page in pss.
>> +     */
>> +    pss->complete_round = false;
>> +
>> +    return found;
>> +}
> 
> I forgot whether Denis and I have discussed this, but I'll try anyways... do
> you think we can avoid touching PageSearchStatus at all?
> 
> PageSearchStatus is used to track a single migration iteration for precopy, so
> that we scan from the 1st ramblock until the last one.  Then we finish one
> iteration.
> 

Yes, my first idea also was to separate normal iteration from 
write-fault page source completely and leave pss for normal scan.. But, 
the other idea is to keep some locality in respect to last write fault. 
I mean it seems to be more optimal to re-start normal scan on the page 
that is next to faulting one. In this case we can save and un-protect
the neighborhood faster and prevent many write faults.

> Snapshot is really something, imho, that can easily leverage this structure
> without touching it - basically we want to do two things:
> 
>    - Do the 1st iteration of precopy (when ram_bulk_stage==true), and do that
>      only.  We never need the 2nd, 3rd, ... iterations because we're snapshoting.
> 
>    - Leverage the postcopy queue mechanism so that when some page got written,
>      queue that page.  We should have this queue higher priority than the
>      precopy scanning mentioned above.
> 
> As long as we follow above rules, then after the above 1st round precopy, we're
> simply done...  If that works, the whole logic of precopy and PageSearchStatus
> does not need to be touched, iiuc.
> 
> [...]
> 

It's quite good alternative and I thought about using postcopy page 
queue, but this implementation won't consider the locality of writes..

What do you think?

>> @@ -2086,7 +2191,8 @@ static void ram_state_reset(RAMState *rs)
>>       rs->last_sent_block = NULL;
>>       rs->last_page = 0;
>>       rs->last_version = ram_list.version;
>> -    rs->ram_bulk_stage = true;
>> +    rs->ram_wt_enabled = migrate_track_writes_ram();
> 
> Maybe we don't need ram_wt_enabled, but just call migrate_track_writes_ram()
> anywhere needed (actually, only in get_fault_page, once).
> 
> Thanks,
> 

Yes, think you are right, we can avoid this additional field.

Thanks,
Peter Xu Nov. 20, 2020, 3:07 p.m. UTC | #3
On Fri, Nov 20, 2020 at 01:44:53PM +0300, Andrey Gruzdev wrote:
> On 19.11.2020 21:25, Peter Xu wrote:
> > On Thu, Nov 19, 2020 at 03:59:36PM +0300, Andrey Gruzdev via wrote:
> > 
> > [...]
> > 
> > > +/**
> > > + * ram_find_block_by_host_address: find RAM block containing host page
> > > + *
> > > + * Returns true if RAM block is found and pss->block/page are
> > > + * pointing to the given host page, false in case of an error
> > > + *
> > > + * @rs: current RAM state
> > > + * @pss: page-search-status structure
> > > + */
> > > +static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
> > > +        hwaddr page_address)
> > > +{
> > > +    bool found = false;
> > > +
> > > +    pss->block = rs->last_seen_block;
> > > +    do {
> > > +        if (page_address >= (hwaddr) pss->block->host &&
> > > +            (page_address + TARGET_PAGE_SIZE) <=
> > > +                    ((hwaddr) pss->block->host + pss->block->used_length)) {
> > > +            pss->page = (unsigned long)
> > > +                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
> > > +            found = true;
> > > +            break;
> > > +        }
> > > +
> > > +        pss->block = QLIST_NEXT_RCU(pss->block, next);
> > > +        if (!pss->block) {
> > > +            /* Hit the end of the list */
> > > +            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
> > > +        }
> > > +    } while (pss->block != rs->last_seen_block);
> > > +
> > > +    rs->last_seen_block = pss->block;
> > > +    /*
> > > +     * Since we are in the same loop with ram_find_and_save_block(),
> > > +     * need to reset pss->complete_round after switching to
> > > +     * other block/page in pss.
> > > +     */
> > > +    pss->complete_round = false;
> > > +
> > > +    return found;
> > > +}
> > 
> > I forgot whether Denis and I have discussed this, but I'll try anyways... do
> > you think we can avoid touching PageSearchStatus at all?
> > 
> > PageSearchStatus is used to track a single migration iteration for precopy, so
> > that we scan from the 1st ramblock until the last one.  Then we finish one
> > iteration.
> > 
> 
> Yes, my first idea also was to separate normal iteration from write-fault
> page source completely and leave pss for normal scan.. But, the other idea
> is to keep some locality in respect to last write fault. I mean it seems to
> be more optimal to re-start normal scan on the page that is next to faulting
> one. In this case we can save and un-protect
> the neighborhood faster and prevent many write faults.

Yeah locality sounds reasonable, and you just reminded me the fact that
postcopy has that already I think. :) Just see get_queued_page():

    if (block) {
        /*
         * As soon as we start servicing pages out of order, then we have
         * to kill the bulk stage, since the bulk stage assumes
         * in (migration_bitmap_find_and_reset_dirty) that every page is
         * dirty, that's no longer true.
         */
        rs->ram_bulk_stage = false;

        /*
         * We want the background search to continue from the queued page
         * since the guest is likely to want other pages near to the page
         * it just requested.
         */
        pss->block = block;
        pss->page = offset >> TARGET_PAGE_BITS;

        /*
         * This unqueued page would break the "one round" check, even is
         * really rare.
         */
        pss->complete_round = false;
    }

So as long as we queue the pages onto the src_page_requests queue, it'll take
care of write locality already, iiuc.

> 
> > Snapshot is really something, imho, that can easily leverage this structure
> > without touching it - basically we want to do two things:
> > 
> >    - Do the 1st iteration of precopy (when ram_bulk_stage==true), and do that
> >      only.  We never need the 2nd, 3rd, ... iterations because we're snapshoting.
> > 
> >    - Leverage the postcopy queue mechanism so that when some page got written,
> >      queue that page.  We should have this queue higher priority than the
> >      precopy scanning mentioned above.
> > 
> > As long as we follow above rules, then after the above 1st round precopy, we're
> > simply done...  If that works, the whole logic of precopy and PageSearchStatus
> > does not need to be touched, iiuc.
> > 
> > [...]
> > 
> 
> It's quite good alternative and I thought about using postcopy page queue,
> but this implementation won't consider the locality of writes..
> 
> What do you think?

So now I think "Do the 1st iteration of precopy only" idea won't work, but
still please consider whether it's natural to just reuse postcopy's queue
mechanism.  IOW, to see whether we can avoid major of the pss logic changes in
this patch.

Thanks,
Andrey Gruzdev Nov. 20, 2020, 4:15 p.m. UTC | #4
On 20.11.2020 18:07, Peter Xu wrote:
> On Fri, Nov 20, 2020 at 01:44:53PM +0300, Andrey Gruzdev wrote:
>> On 19.11.2020 21:25, Peter Xu wrote:
>>> On Thu, Nov 19, 2020 at 03:59:36PM +0300, Andrey Gruzdev via wrote:
>>>
>>> [...]
>>>
>>>> +/**
>>>> + * ram_find_block_by_host_address: find RAM block containing host page
>>>> + *
>>>> + * Returns true if RAM block is found and pss->block/page are
>>>> + * pointing to the given host page, false in case of an error
>>>> + *
>>>> + * @rs: current RAM state
>>>> + * @pss: page-search-status structure
>>>> + */
>>>> +static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
>>>> +        hwaddr page_address)
>>>> +{
>>>> +    bool found = false;
>>>> +
>>>> +    pss->block = rs->last_seen_block;
>>>> +    do {
>>>> +        if (page_address >= (hwaddr) pss->block->host &&
>>>> +            (page_address + TARGET_PAGE_SIZE) <=
>>>> +                    ((hwaddr) pss->block->host + pss->block->used_length)) {
>>>> +            pss->page = (unsigned long)
>>>> +                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
>>>> +            found = true;
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        pss->block = QLIST_NEXT_RCU(pss->block, next);
>>>> +        if (!pss->block) {
>>>> +            /* Hit the end of the list */
>>>> +            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
>>>> +        }
>>>> +    } while (pss->block != rs->last_seen_block);
>>>> +
>>>> +    rs->last_seen_block = pss->block;
>>>> +    /*
>>>> +     * Since we are in the same loop with ram_find_and_save_block(),
>>>> +     * need to reset pss->complete_round after switching to
>>>> +     * other block/page in pss.
>>>> +     */
>>>> +    pss->complete_round = false;
>>>> +
>>>> +    return found;
>>>> +}
>>>
>>> I forgot whether Denis and I have discussed this, but I'll try anyways... do
>>> you think we can avoid touching PageSearchStatus at all?
>>>
>>> PageSearchStatus is used to track a single migration iteration for precopy, so
>>> that we scan from the 1st ramblock until the last one.  Then we finish one
>>> iteration.
>>>
>>
>> Yes, my first idea also was to separate normal iteration from write-fault
>> page source completely and leave pss for normal scan.. But, the other idea
>> is to keep some locality in respect to last write fault. I mean it seems to
>> be more optimal to re-start normal scan on the page that is next to faulting
>> one. In this case we can save and un-protect
>> the neighborhood faster and prevent many write faults.
> 
> Yeah locality sounds reasonable, and you just reminded me the fact that
> postcopy has that already I think. :) Just see get_queued_page():
> 
>      if (block) {
>          /*
>           * As soon as we start servicing pages out of order, then we have
>           * to kill the bulk stage, since the bulk stage assumes
>           * in (migration_bitmap_find_and_reset_dirty) that every page is
>           * dirty, that's no longer true.
>           */
>          rs->ram_bulk_stage = false;
> 
>          /*
>           * We want the background search to continue from the queued page
>           * since the guest is likely to want other pages near to the page
>           * it just requested.
>           */
>          pss->block = block;
>          pss->page = offset >> TARGET_PAGE_BITS;
> 
>          /*
>           * This unqueued page would break the "one round" check, even is
>           * really rare.
>           */
>          pss->complete_round = false;
>      }
> 
> So as long as we queue the pages onto the src_page_requests queue, it'll take
> care of write locality already, iiuc.
> 
>>
>>> Snapshot is really something, imho, that can easily leverage this structure
>>> without touching it - basically we want to do two things:
>>>
>>>     - Do the 1st iteration of precopy (when ram_bulk_stage==true), and do that
>>>       only.  We never need the 2nd, 3rd, ... iterations because we're snapshoting.
>>>
>>>     - Leverage the postcopy queue mechanism so that when some page got written,
>>>       queue that page.  We should have this queue higher priority than the
>>>       precopy scanning mentioned above.
>>>
>>> As long as we follow above rules, then after the above 1st round precopy, we're
>>> simply done...  If that works, the whole logic of precopy and PageSearchStatus
>>> does not need to be touched, iiuc.
>>>
>>> [...]
>>>
>>
>> It's quite good alternative and I thought about using postcopy page queue,
>> but this implementation won't consider the locality of writes..
>>
>> What do you think?
> 
> So now I think "Do the 1st iteration of precopy only" idea won't work, but
> still please consider whether it's natural to just reuse postcopy's queue
> mechanism.  IOW, to see whether we can avoid major of the pss logic changes in
> this patch.
> 
> Thanks,
> 

Yeah, I think we can re-use the postcopy queue code for faulting pages. 
I'm worring a little about some additional overhead dealing with urgent 
request semaphore. Also, the code won't change a lot, something like:

[...]
         /* In case of 'write-tracking' migration we first try
          * to poll UFFD and sse if we have write page fault event */
         poll_fault_page(rs);

         again = true;
         found = get_queued_page(rs, &pss);

         if (!found) {
             /* priority queue empty, so just search for something dirty */
             found = find_dirty_block(rs, &pss, &again);
         }
[...]
Peter Xu Nov. 20, 2020, 4:43 p.m. UTC | #5
On Fri, Nov 20, 2020 at 07:15:07PM +0300, Andrey Gruzdev wrote:
> Yeah, I think we can re-use the postcopy queue code for faulting pages. I'm
> worring a little about some additional overhead dealing with urgent request
> semaphore. Also, the code won't change a lot, something like:
> 
> [...]
>         /* In case of 'write-tracking' migration we first try
>          * to poll UFFD and sse if we have write page fault event */
>         poll_fault_page(rs);
> 
>         again = true;
>         found = get_queued_page(rs, &pss);
> 
>         if (!found) {
>             /* priority queue empty, so just search for something dirty */
>             found = find_dirty_block(rs, &pss, &again);
>         }
> [...]

Could I ask what's the "urgent request semaphore"?  Thanks,
Andrey Gruzdev Nov. 20, 2020, 4:53 p.m. UTC | #6
On 20.11.2020 19:43, Peter Xu wrote:
> On Fri, Nov 20, 2020 at 07:15:07PM +0300, Andrey Gruzdev wrote:
>> Yeah, I think we can re-use the postcopy queue code for faulting pages. I'm
>> worring a little about some additional overhead dealing with urgent request
>> semaphore. Also, the code won't change a lot, something like:
>>
>> [...]
>>          /* In case of 'write-tracking' migration we first try
>>           * to poll UFFD and sse if we have write page fault event */
>>          poll_fault_page(rs);
>>
>>          again = true;
>>          found = get_queued_page(rs, &pss);
>>
>>          if (!found) {
>>              /* priority queue empty, so just search for something dirty */
>>              found = find_dirty_block(rs, &pss, &again);
>>          }
>> [...]
> 
> Could I ask what's the "urgent request semaphore"?  Thanks,
> 

These function use it (the correct name is 'rate_limit_sem'):

void migration_make_urgent_request(void)
{
     qemu_sem_post(&migrate_get_current()->rate_limit_sem);
}

void migration_consume_urgent_request(void)
{
     qemu_sem_wait(&migrate_get_current()->rate_limit_sem);
}

They are called from ram_save_queue_pages and unqueue_page, accordingly, 
to control migration rate limiter.

bool migration_rate_limit(void)
{
[...]
         /*
          * Wait for a delay to do rate limiting OR
          * something urgent to post the semaphore.
          */
         int ms = s->iteration_start_time + BUFFER_DELAY - now;
         trace_migration_rate_limit_pre(ms);
         if (qemu_sem_timedwait(&s->rate_limit_sem, ms) == 0) {
             /*
              * We were woken by one or more urgent things but
              * the timedwait will have consumed one of them.
              * The service routine for the urgent wake will dec
              * the semaphore itself for each item it consumes,
              * so add this one we just eat back.
              */
             qemu_sem_post(&s->rate_limit_sem);
             urgent = true;
         }
[...]
}
Peter Xu Nov. 23, 2020, 9:34 p.m. UTC | #7
On Fri, Nov 20, 2020 at 07:53:34PM +0300, Andrey Gruzdev wrote:
> On 20.11.2020 19:43, Peter Xu wrote:
> > On Fri, Nov 20, 2020 at 07:15:07PM +0300, Andrey Gruzdev wrote:
> > > Yeah, I think we can re-use the postcopy queue code for faulting pages. I'm
> > > worring a little about some additional overhead dealing with urgent request
> > > semaphore. Also, the code won't change a lot, something like:
> > > 
> > > [...]
> > >          /* In case of 'write-tracking' migration we first try
> > >           * to poll UFFD and sse if we have write page fault event */
> > >          poll_fault_page(rs);
> > > 
> > >          again = true;
> > >          found = get_queued_page(rs, &pss);
> > > 
> > >          if (!found) {
> > >              /* priority queue empty, so just search for something dirty */
> > >              found = find_dirty_block(rs, &pss, &again);
> > >          }
> > > [...]
> > 
> > Could I ask what's the "urgent request semaphore"?  Thanks,
> > 
> 
> These function use it (the correct name is 'rate_limit_sem'):
> 
> void migration_make_urgent_request(void)
> {
>     qemu_sem_post(&migrate_get_current()->rate_limit_sem);
> }
> 
> void migration_consume_urgent_request(void)
> {
>     qemu_sem_wait(&migrate_get_current()->rate_limit_sem);
> }
> 
> They are called from ram_save_queue_pages and unqueue_page, accordingly, to
> control migration rate limiter.
> 
> bool migration_rate_limit(void)
> {
> [...]
>         /*
>          * Wait for a delay to do rate limiting OR
>          * something urgent to post the semaphore.
>          */
>         int ms = s->iteration_start_time + BUFFER_DELAY - now;
>         trace_migration_rate_limit_pre(ms);
>         if (qemu_sem_timedwait(&s->rate_limit_sem, ms) == 0) {
>             /*
>              * We were woken by one or more urgent things but
>              * the timedwait will have consumed one of them.
>              * The service routine for the urgent wake will dec
>              * the semaphore itself for each item it consumes,
>              * so add this one we just eat back.
>              */
>             qemu_sem_post(&s->rate_limit_sem);
>             urgent = true;
>         }
> [...]
> }
> 

Hmm... Why its overhead could be a problem?  If it's an overhead that can be
avoided, then postcopy might want that too.

The thing is I really feel like the snapshot logic can leverage the whole idea
of existing postcopy (like get_queued_page/unqueue_page; it's probably due to
the fact that both of them want to "migrate some more urgent pages than the
background migration, due to either missing/wrprotected pages"), but I might
have something missing.

Thanks,
Andrey Gruzdev Nov. 24, 2020, 8:02 a.m. UTC | #8
On 24.11.2020 00:34, Peter Xu wrote:
> On Fri, Nov 20, 2020 at 07:53:34PM +0300, Andrey Gruzdev wrote:
>> On 20.11.2020 19:43, Peter Xu wrote:
>>> On Fri, Nov 20, 2020 at 07:15:07PM +0300, Andrey Gruzdev wrote:
>>>> Yeah, I think we can re-use the postcopy queue code for faulting pages. I'm
>>>> worring a little about some additional overhead dealing with urgent request
>>>> semaphore. Also, the code won't change a lot, something like:
>>>>
>>>> [...]
>>>>           /* In case of 'write-tracking' migration we first try
>>>>            * to poll UFFD and sse if we have write page fault event */
>>>>           poll_fault_page(rs);
>>>>
>>>>           again = true;
>>>>           found = get_queued_page(rs, &pss);
>>>>
>>>>           if (!found) {
>>>>               /* priority queue empty, so just search for something dirty */
>>>>               found = find_dirty_block(rs, &pss, &again);
>>>>           }
>>>> [...]
>>>
>>> Could I ask what's the "urgent request semaphore"?  Thanks,
>>>
>>
>> These function use it (the correct name is 'rate_limit_sem'):
>>
>> void migration_make_urgent_request(void)
>> {
>>      qemu_sem_post(&migrate_get_current()->rate_limit_sem);
>> }
>>
>> void migration_consume_urgent_request(void)
>> {
>>      qemu_sem_wait(&migrate_get_current()->rate_limit_sem);
>> }
>>
>> They are called from ram_save_queue_pages and unqueue_page, accordingly, to
>> control migration rate limiter.
>>
>> bool migration_rate_limit(void)
>> {
>> [...]
>>          /*
>>           * Wait for a delay to do rate limiting OR
>>           * something urgent to post the semaphore.
>>           */
>>          int ms = s->iteration_start_time + BUFFER_DELAY - now;
>>          trace_migration_rate_limit_pre(ms);
>>          if (qemu_sem_timedwait(&s->rate_limit_sem, ms) == 0) {
>>              /*
>>               * We were woken by one or more urgent things but
>>               * the timedwait will have consumed one of them.
>>               * The service routine for the urgent wake will dec
>>               * the semaphore itself for each item it consumes,
>>               * so add this one we just eat back.
>>               */
>>              qemu_sem_post(&s->rate_limit_sem);
>>              urgent = true;
>>          }
>> [...]
>> }
>>
> 
> Hmm... Why its overhead could be a problem?  If it's an overhead that can be
> avoided, then postcopy might want that too.
> 
> The thing is I really feel like the snapshot logic can leverage the whole idea
> of existing postcopy (like get_queued_page/unqueue_page; it's probably due to
> the fact that both of them want to "migrate some more urgent pages than the
> background migration, due to either missing/wrprotected pages"), but I might
> have something missing.
> 
> Thanks,
> 

I don't think this overhead itself is a problem since its negligible 
compared to other stuff.. You're undoubtly correct about using postcopy 
idea in case when wr-fault pages arrive from the separate thread or 
external source. Then we really need to decouple that separate thread or 
external requestor from the migration thread.

In this patch series wr-faults arrive in the same migration loop with 
normal scan, so I don't see direct reason to pass it through the queue, 
but I might have missed something from your proposal.

Do you mean that if we use postcopy logic, we'll allocate memory and 
make copies of faulting pages and then immediately un-protect them?
Or we just put faulting page address to the queued item and release 
protection after page content has been saved.

Thanks,
Peter Xu Nov. 24, 2020, 3:17 p.m. UTC | #9
On Tue, Nov 24, 2020 at 11:02:09AM +0300, Andrey Gruzdev wrote:
> On 24.11.2020 00:34, Peter Xu wrote:
> > On Fri, Nov 20, 2020 at 07:53:34PM +0300, Andrey Gruzdev wrote:
> > > On 20.11.2020 19:43, Peter Xu wrote:
> > > > On Fri, Nov 20, 2020 at 07:15:07PM +0300, Andrey Gruzdev wrote:
> > > > > Yeah, I think we can re-use the postcopy queue code for faulting pages. I'm
> > > > > worring a little about some additional overhead dealing with urgent request
> > > > > semaphore. Also, the code won't change a lot, something like:
> > > > > 
> > > > > [...]
> > > > >           /* In case of 'write-tracking' migration we first try
> > > > >            * to poll UFFD and sse if we have write page fault event */
> > > > >           poll_fault_page(rs);
> > > > > 
> > > > >           again = true;
> > > > >           found = get_queued_page(rs, &pss);
> > > > > 
> > > > >           if (!found) {
> > > > >               /* priority queue empty, so just search for something dirty */
> > > > >               found = find_dirty_block(rs, &pss, &again);
> > > > >           }
> > > > > [...]
> > > > 
> > > > Could I ask what's the "urgent request semaphore"?  Thanks,
> > > > 
> > > 
> > > These function use it (the correct name is 'rate_limit_sem'):
> > > 
> > > void migration_make_urgent_request(void)
> > > {
> > >      qemu_sem_post(&migrate_get_current()->rate_limit_sem);
> > > }
> > > 
> > > void migration_consume_urgent_request(void)
> > > {
> > >      qemu_sem_wait(&migrate_get_current()->rate_limit_sem);
> > > }
> > > 
> > > They are called from ram_save_queue_pages and unqueue_page, accordingly, to
> > > control migration rate limiter.
> > > 
> > > bool migration_rate_limit(void)
> > > {
> > > [...]
> > >          /*
> > >           * Wait for a delay to do rate limiting OR
> > >           * something urgent to post the semaphore.
> > >           */
> > >          int ms = s->iteration_start_time + BUFFER_DELAY - now;
> > >          trace_migration_rate_limit_pre(ms);
> > >          if (qemu_sem_timedwait(&s->rate_limit_sem, ms) == 0) {
> > >              /*
> > >               * We were woken by one or more urgent things but
> > >               * the timedwait will have consumed one of them.
> > >               * The service routine for the urgent wake will dec
> > >               * the semaphore itself for each item it consumes,
> > >               * so add this one we just eat back.
> > >               */
> > >              qemu_sem_post(&s->rate_limit_sem);
> > >              urgent = true;
> > >          }
> > > [...]
> > > }
> > > 
> > 
> > Hmm... Why its overhead could be a problem?  If it's an overhead that can be
> > avoided, then postcopy might want that too.
> > 
> > The thing is I really feel like the snapshot logic can leverage the whole idea
> > of existing postcopy (like get_queued_page/unqueue_page; it's probably due to
> > the fact that both of them want to "migrate some more urgent pages than the
> > background migration, due to either missing/wrprotected pages"), but I might
> > have something missing.
> > 
> > Thanks,
> > 
> 
> I don't think this overhead itself is a problem since its negligible
> compared to other stuff.. You're undoubtly correct about using postcopy idea
> in case when wr-fault pages arrive from the separate thread or external
> source. Then we really need to decouple that separate thread or external
> requestor from the migration thread.
> 
> In this patch series wr-faults arrive in the same migration loop with normal
> scan, so I don't see direct reason to pass it through the queue, but I might
> have missed something from your proposal.

I see your point.  For me whether using a thread is not extremely important -
actually we can have a standalone thread to handle the vcpu faults too just
like postcopy; it's just run on the src host.  My major point is that we should
avoid introducing the extra pss logic because fundamentally it's doing the same
thing as get_queued_page() right now, unless I'm wrong...

So I think your previous poll_fault_page() call is okay; assuming the same poll
model as you used, I mean something like this:

------8<-------
diff --git a/migration/ram.c b/migration/ram.c
index 7811cde643..718dd276c7 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1473,6 +1473,14 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
 
     } while (block && !dirty);
 
+    if (!block) {
+        /*
+         * Set the block too if live snapshot is enabled; that's when we have
+         * vcpus got blocked by the wr-protected pages.
+         */
+        block = poll_fault_page(rs, &offset);
+    }
+
     if (block) {
         /*
          * As soon as we start servicing pages out of order, then we have
------8<-------

So as long as we set the block and offset, pss will be updated just like
postcopy.  That's exactly what we want, am I right?

> 
> Do you mean that if we use postcopy logic, we'll allocate memory and make
> copies of faulting pages and then immediately un-protect them?
> Or we just put faulting page address to the queued item and release
> protection after page content has been saved.

I think current approach would be fine, so we don't copy page and unprotect at
a single place after the page is dumped to the precopy stream.  If you could
measure the latencies then it'll be even better, then we'll know what to expect.

IIRC the "copy page first" idea came from Denis's initial version, in which I
thought as too aggresive since potentially it can eat twice the memory on the
host for the single guest, not to mention when live snapshot is taken for
mutliple guests on the same host.  It's just not something that can be directly
expected when the user wants to take a snapshot.  That's something we can do
upon this work, imho, if we'll have very high latency on resolving the page
faults. But that's still the last thing to try (or at least an "by default off"
option) so we can still think about some other solution before trying to eat
too much host mem.

Thanks,
Andrey Gruzdev Nov. 24, 2020, 5:40 p.m. UTC | #10
On 24.11.2020 18:17, Peter Xu wrote:
> On Tue, Nov 24, 2020 at 11:02:09AM +0300, Andrey Gruzdev wrote:
>> On 24.11.2020 00:34, Peter Xu wrote:
>>> On Fri, Nov 20, 2020 at 07:53:34PM +0300, Andrey Gruzdev wrote:
>>>> On 20.11.2020 19:43, Peter Xu wrote:
>>>>> On Fri, Nov 20, 2020 at 07:15:07PM +0300, Andrey Gruzdev wrote:
>>>>>> Yeah, I think we can re-use the postcopy queue code for faulting pages. I'm
>>>>>> worring a little about some additional overhead dealing with urgent request
>>>>>> semaphore. Also, the code won't change a lot, something like:
>>>>>>
>>>>>> [...]
>>>>>>            /* In case of 'write-tracking' migration we first try
>>>>>>             * to poll UFFD and sse if we have write page fault event */
>>>>>>            poll_fault_page(rs);
>>>>>>
>>>>>>            again = true;
>>>>>>            found = get_queued_page(rs, &pss);
>>>>>>
>>>>>>            if (!found) {
>>>>>>                /* priority queue empty, so just search for something dirty */
>>>>>>                found = find_dirty_block(rs, &pss, &again);
>>>>>>            }
>>>>>> [...]
>>>>>
>>>>> Could I ask what's the "urgent request semaphore"?  Thanks,
>>>>>
>>>>
>>>> These function use it (the correct name is 'rate_limit_sem'):
>>>>
>>>> void migration_make_urgent_request(void)
>>>> {
>>>>       qemu_sem_post(&migrate_get_current()->rate_limit_sem);
>>>> }
>>>>
>>>> void migration_consume_urgent_request(void)
>>>> {
>>>>       qemu_sem_wait(&migrate_get_current()->rate_limit_sem);
>>>> }
>>>>
>>>> They are called from ram_save_queue_pages and unqueue_page, accordingly, to
>>>> control migration rate limiter.
>>>>
>>>> bool migration_rate_limit(void)
>>>> {
>>>> [...]
>>>>           /*
>>>>            * Wait for a delay to do rate limiting OR
>>>>            * something urgent to post the semaphore.
>>>>            */
>>>>           int ms = s->iteration_start_time + BUFFER_DELAY - now;
>>>>           trace_migration_rate_limit_pre(ms);
>>>>           if (qemu_sem_timedwait(&s->rate_limit_sem, ms) == 0) {
>>>>               /*
>>>>                * We were woken by one or more urgent things but
>>>>                * the timedwait will have consumed one of them.
>>>>                * The service routine for the urgent wake will dec
>>>>                * the semaphore itself for each item it consumes,
>>>>                * so add this one we just eat back.
>>>>                */
>>>>               qemu_sem_post(&s->rate_limit_sem);
>>>>               urgent = true;
>>>>           }
>>>> [...]
>>>> }
>>>>
>>>
>>> Hmm... Why its overhead could be a problem?  If it's an overhead that can be
>>> avoided, then postcopy might want that too.
>>>
>>> The thing is I really feel like the snapshot logic can leverage the whole idea
>>> of existing postcopy (like get_queued_page/unqueue_page; it's probably due to
>>> the fact that both of them want to "migrate some more urgent pages than the
>>> background migration, due to either missing/wrprotected pages"), but I might
>>> have something missing.
>>>
>>> Thanks,
>>>
>>
>> I don't think this overhead itself is a problem since its negligible
>> compared to other stuff.. You're undoubtly correct about using postcopy idea
>> in case when wr-fault pages arrive from the separate thread or external
>> source. Then we really need to decouple that separate thread or external
>> requestor from the migration thread.
>>
>> In this patch series wr-faults arrive in the same migration loop with normal
>> scan, so I don't see direct reason to pass it through the queue, but I might
>> have missed something from your proposal.
> 
> I see your point.  For me whether using a thread is not extremely important -
> actually we can have a standalone thread to handle the vcpu faults too just
> like postcopy; it's just run on the src host.  My major point is that we should
> avoid introducing the extra pss logic because fundamentally it's doing the same
> thing as get_queued_page() right now, unless I'm wrong...
> 
> So I think your previous poll_fault_page() call is okay; assuming the same poll
> model as you used, I mean something like this:
> 
> ------8<-------
> diff --git a/migration/ram.c b/migration/ram.c
> index 7811cde643..718dd276c7 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1473,6 +1473,14 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
>   
>       } while (block && !dirty);
>   
> +    if (!block) {
> +        /*
> +         * Set the block too if live snapshot is enabled; that's when we have
> +         * vcpus got blocked by the wr-protected pages.
> +         */
> +        block = poll_fault_page(rs, &offset);
> +    }
> +
>       if (block) {
>           /*
>            * As soon as we start servicing pages out of order, then we have
> ------8<-------
> 
> So as long as we set the block and offset, pss will be updated just like
> postcopy.  That's exactly what we want, am I right?
> 

Also think the best place to put polling call here, not to touch pss yet 
again.

>>
>> Do you mean that if we use postcopy logic, we'll allocate memory and make
>> copies of faulting pages and then immediately un-protect them?
>> Or we just put faulting page address to the queued item and release
>> protection after page content has been saved.
> 
> I think current approach would be fine, so we don't copy page and unprotect at
> a single place after the page is dumped to the precopy stream.  If you could
> measure the latencies then it'll be even better, then we'll know what to expect.
> 
> IIRC the "copy page first" idea came from Denis's initial version, in which I
> thought as too aggresive since potentially it can eat twice the memory on the
> host for the single guest, not to mention when live snapshot is taken for
> mutliple guests on the same host.  It's just not something that can be directly
> expected when the user wants to take a snapshot.  That's something we can do
> upon this work, imho, if we'll have very high latency on resolving the page
> faults. But that's still the last thing to try (or at least an "by default off"
> option) so we can still think about some other solution before trying to eat
> too much host mem.
> 
> Thanks,
> 

I totally agree that creating shadow copies for each faulting page is 
not a very robust idea.. Cause we could never know the 'proper' limit 
for the buffer memory we should allocate and cut from the host, it 
depends on hardware, guest workload, host I/O intensity, too many things.

So I believe file/block backend side is a better place to provide some 
buffering and keep memory/latency tradeoff. Also it allows to be less 
'invasive' to QEMU code itself.

Thanks,
Dr. David Alan Gilbert Nov. 25, 2020, 1:08 p.m. UTC | #11
* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> In this particular implementation the same single migration
> thread is responsible for both normal linear dirty page
> migration and procesing UFFD page fault events.
> 
> Processing write faults includes reading UFFD file descriptor,
> finding respective RAM block and saving faulting page to
> the migration stream. After page has been saved, write protection
> can be removed. Since asynchronous version of qemu_put_buffer()
> is expected to be used to save pages, we also have to flush
> migraion stream prior to un-protecting saved memory range.
> 
> Write protection is being removed for any previously protected
> memory chunk that has hit the migration stream. That's valid
> for pages from linear page scan along with write fault pages.
> 
> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> ---
>  migration/ram.c | 124 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 115 insertions(+), 9 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 7f273c9996..08a1d7a252 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -314,6 +314,8 @@ struct RAMState {
>      ram_addr_t last_page;
>      /* last ram version we have seen */
>      uint32_t last_version;
> +    /* 'write-tracking' migration is enabled */
> +    bool ram_wt_enabled;
>      /* We are in the first round */
>      bool ram_bulk_stage;
>      /* The free page optimization is enabled */
> @@ -574,8 +576,6 @@ static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
>      return 0;
>  }
>  
> -__attribute__ ((unused))
> -static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
>  __attribute__ ((unused))
>  static bool uffd_poll_events(int uffd, int tmo);
>  
> @@ -1929,6 +1929,86 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>      return pages;
>  }
>  
> +/**
> + * ram_find_block_by_host_address: find RAM block containing host page
> + *
> + * Returns true if RAM block is found and pss->block/page are
> + * pointing to the given host page, false in case of an error
> + *
> + * @rs: current RAM state
> + * @pss: page-search-status structure
> + */
> +static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
> +        hwaddr page_address)
> +{
> +    bool found = false;
> +
> +    pss->block = rs->last_seen_block;
> +    do {
> +        if (page_address >= (hwaddr) pss->block->host &&
> +            (page_address + TARGET_PAGE_SIZE) <=
> +                    ((hwaddr) pss->block->host + pss->block->used_length)) {
> +            pss->page = (unsigned long)
> +                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
> +            found = true;
> +            break;
> +        }
> +
> +        pss->block = QLIST_NEXT_RCU(pss->block, next);
> +        if (!pss->block) {
> +            /* Hit the end of the list */
> +            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
> +        }
> +    } while (pss->block != rs->last_seen_block);
> +
> +    rs->last_seen_block = pss->block;
> +    /*
> +     * Since we are in the same loop with ram_find_and_save_block(),
> +     * need to reset pss->complete_round after switching to
> +     * other block/page in pss.
> +     */
> +    pss->complete_round = false;
> +
> +    return found;
> +}
> +
> +/**
> + * get_fault_page: try to get next UFFD write fault page and, if pending fault
> + *   is found, put info about RAM block and block page into pss structure
> + *
> + * Returns true in case of UFFD write fault detected, false otherwise
> + *
> + * @rs: current RAM state
> + * @pss: page-search-status structure
> + *
> + */
> +static bool get_fault_page(RAMState *rs, PageSearchStatus *pss)
> +{
> +    struct uffd_msg uffd_msg;
> +    hwaddr page_address;
> +    int res;
> +
> +    if (!rs->ram_wt_enabled) {
> +        return false;
> +    }
> +
> +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
> +    if (res <= 0) {
> +        return false;
> +    }
> +
> +    page_address = uffd_msg.arg.pagefault.address;
> +    if (!ram_find_block_by_host_address(rs, pss, page_address)) {
> +        /* In case we couldn't find respective block, just unprotect faulting page */
> +        uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false);
> +        error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
> +                     page_address);
> +        return false;
> +    }
> +
> +    return true;
> +}
> +
>  /**
>   * ram_find_and_save_block: finds a dirty page and sends it to f
>   *
> @@ -1955,25 +2035,50 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>          return pages;
>      }
>  
> +    if (!rs->last_seen_block) {
> +        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
> +    }
>      pss.block = rs->last_seen_block;
>      pss.page = rs->last_page;
>      pss.complete_round = false;
>  
> -    if (!pss.block) {
> -        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
> -    }
> -
>      do {
> +        ram_addr_t page;
> +        ram_addr_t page_to;
> +
>          again = true;
> -        found = get_queued_page(rs, &pss);
> -
> +        /* In case of 'write-tracking' migration we first try
> +         * to poll UFFD and get write page fault event */
> +        found = get_fault_page(rs, &pss);
> +        if (!found) {
> +            /* No trying to fetch something from the priority queue */
> +            found = get_queued_page(rs, &pss);
> +        }
>          if (!found) {
>              /* priority queue empty, so just search for something dirty */
>              found = find_dirty_block(rs, &pss, &again);
>          }
>  
>          if (found) {
> +            page = pss.page;
>              pages = ram_save_host_page(rs, &pss, last_stage);
> +            page_to = pss.page;
> +
> +            /* Check if page is from UFFD-managed region */
> +            if (pss.block->flags & RAM_UF_WRITEPROTECT) {
> +                hwaddr page_address = (hwaddr) pss.block->host +
> +                        ((hwaddr) page << TARGET_PAGE_BITS);
> +                hwaddr run_length = (hwaddr) (page_to - page + 1) << TARGET_PAGE_BITS;
> +                int res;
> +
> +                /* Flush async buffers before un-protect */
> +                qemu_fflush(rs->f);
> +                /* Un-protect memory range */
> +                res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false);
> +                if (res < 0) {
> +                    break;
> +                }
> +            }

Please separate that out into a separate function.

Dave

>          }
>      } while (!pages && again);
>  
> @@ -2086,7 +2191,8 @@ static void ram_state_reset(RAMState *rs)
>      rs->last_sent_block = NULL;
>      rs->last_page = 0;
>      rs->last_version = ram_list.version;
> -    rs->ram_bulk_stage = true;
> +    rs->ram_wt_enabled = migrate_track_writes_ram();
> +    rs->ram_bulk_stage = !rs->ram_wt_enabled;
>      rs->fpo_enabled = false;
>  }
>  
> -- 
> 2.25.1
>
Andrey Gruzdev Nov. 25, 2020, 2:40 p.m. UTC | #12
On 25.11.2020 16:08, Dr. David Alan Gilbert wrote:
> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>> In this particular implementation the same single migration
>> thread is responsible for both normal linear dirty page
>> migration and procesing UFFD page fault events.
>>
>> Processing write faults includes reading UFFD file descriptor,
>> finding respective RAM block and saving faulting page to
>> the migration stream. After page has been saved, write protection
>> can be removed. Since asynchronous version of qemu_put_buffer()
>> is expected to be used to save pages, we also have to flush
>> migraion stream prior to un-protecting saved memory range.
>>
>> Write protection is being removed for any previously protected
>> memory chunk that has hit the migration stream. That's valid
>> for pages from linear page scan along with write fault pages.
>>
>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>> ---
>>   migration/ram.c | 124 ++++++++++++++++++++++++++++++++++++++++++++----
>>   1 file changed, 115 insertions(+), 9 deletions(-)
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 7f273c9996..08a1d7a252 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -314,6 +314,8 @@ struct RAMState {
>>       ram_addr_t last_page;
>>       /* last ram version we have seen */
>>       uint32_t last_version;
>> +    /* 'write-tracking' migration is enabled */
>> +    bool ram_wt_enabled;
>>       /* We are in the first round */
>>       bool ram_bulk_stage;
>>       /* The free page optimization is enabled */
>> @@ -574,8 +576,6 @@ static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
>>       return 0;
>>   }
>>   
>> -__attribute__ ((unused))
>> -static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
>>   __attribute__ ((unused))
>>   static bool uffd_poll_events(int uffd, int tmo);
>>   
>> @@ -1929,6 +1929,86 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>>       return pages;
>>   }
>>   
>> +/**
>> + * ram_find_block_by_host_address: find RAM block containing host page
>> + *
>> + * Returns true if RAM block is found and pss->block/page are
>> + * pointing to the given host page, false in case of an error
>> + *
>> + * @rs: current RAM state
>> + * @pss: page-search-status structure
>> + */
>> +static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
>> +        hwaddr page_address)
>> +{
>> +    bool found = false;
>> +
>> +    pss->block = rs->last_seen_block;
>> +    do {
>> +        if (page_address >= (hwaddr) pss->block->host &&
>> +            (page_address + TARGET_PAGE_SIZE) <=
>> +                    ((hwaddr) pss->block->host + pss->block->used_length)) {
>> +            pss->page = (unsigned long)
>> +                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
>> +            found = true;
>> +            break;
>> +        }
>> +
>> +        pss->block = QLIST_NEXT_RCU(pss->block, next);
>> +        if (!pss->block) {
>> +            /* Hit the end of the list */
>> +            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
>> +        }
>> +    } while (pss->block != rs->last_seen_block);
>> +
>> +    rs->last_seen_block = pss->block;
>> +    /*
>> +     * Since we are in the same loop with ram_find_and_save_block(),
>> +     * need to reset pss->complete_round after switching to
>> +     * other block/page in pss.
>> +     */
>> +    pss->complete_round = false;
>> +
>> +    return found;
>> +}
>> +
>> +/**
>> + * get_fault_page: try to get next UFFD write fault page and, if pending fault
>> + *   is found, put info about RAM block and block page into pss structure
>> + *
>> + * Returns true in case of UFFD write fault detected, false otherwise
>> + *
>> + * @rs: current RAM state
>> + * @pss: page-search-status structure
>> + *
>> + */
>> +static bool get_fault_page(RAMState *rs, PageSearchStatus *pss)
>> +{
>> +    struct uffd_msg uffd_msg;
>> +    hwaddr page_address;
>> +    int res;
>> +
>> +    if (!rs->ram_wt_enabled) {
>> +        return false;
>> +    }
>> +
>> +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
>> +    if (res <= 0) {
>> +        return false;
>> +    }
>> +
>> +    page_address = uffd_msg.arg.pagefault.address;
>> +    if (!ram_find_block_by_host_address(rs, pss, page_address)) {
>> +        /* In case we couldn't find respective block, just unprotect faulting page */
>> +        uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false);
>> +        error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
>> +                     page_address);
>> +        return false;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>>   /**
>>    * ram_find_and_save_block: finds a dirty page and sends it to f
>>    *
>> @@ -1955,25 +2035,50 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>>           return pages;
>>       }
>>   
>> +    if (!rs->last_seen_block) {
>> +        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
>> +    }
>>       pss.block = rs->last_seen_block;
>>       pss.page = rs->last_page;
>>       pss.complete_round = false;
>>   
>> -    if (!pss.block) {
>> -        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
>> -    }
>> -
>>       do {
>> +        ram_addr_t page;
>> +        ram_addr_t page_to;
>> +
>>           again = true;
>> -        found = get_queued_page(rs, &pss);
>> -
>> +        /* In case of 'write-tracking' migration we first try
>> +         * to poll UFFD and get write page fault event */
>> +        found = get_fault_page(rs, &pss);
>> +        if (!found) {
>> +            /* No trying to fetch something from the priority queue */
>> +            found = get_queued_page(rs, &pss);
>> +        }
>>           if (!found) {
>>               /* priority queue empty, so just search for something dirty */
>>               found = find_dirty_block(rs, &pss, &again);
>>           }
>>   
>>           if (found) {
>> +            page = pss.page;
>>               pages = ram_save_host_page(rs, &pss, last_stage);
>> +            page_to = pss.page;
>> +
>> +            /* Check if page is from UFFD-managed region */
>> +            if (pss.block->flags & RAM_UF_WRITEPROTECT) {
>> +                hwaddr page_address = (hwaddr) pss.block->host +
>> +                        ((hwaddr) page << TARGET_PAGE_BITS);
>> +                hwaddr run_length = (hwaddr) (page_to - page + 1) << TARGET_PAGE_BITS;
>> +                int res;
>> +
>> +                /* Flush async buffers before un-protect */
>> +                qemu_fflush(rs->f);
>> +                /* Un-protect memory range */
>> +                res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false);
>> +                if (res < 0) {
>> +                    break;
>> +                }
>> +            }
> 
> Please separate that out into a separate function.
> 
> Dave
> 

You mean separate implementation of ram_find_and_save_block()?

Andrey

>>           }
>>       } while (!pages && again);
>>   
>> @@ -2086,7 +2191,8 @@ static void ram_state_reset(RAMState *rs)
>>       rs->last_sent_block = NULL;
>>       rs->last_page = 0;
>>       rs->last_version = ram_list.version;
>> -    rs->ram_bulk_stage = true;
>> +    rs->ram_wt_enabled = migrate_track_writes_ram();
>> +    rs->ram_bulk_stage = !rs->ram_wt_enabled;
>>       rs->fpo_enabled = false;
>>   }
>>   
>> -- 
>> 2.25.1
>>
Dr. David Alan Gilbert Nov. 25, 2020, 6:41 p.m. UTC | #13
* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> On 25.11.2020 16:08, Dr. David Alan Gilbert wrote:
> > * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> > > In this particular implementation the same single migration
> > > thread is responsible for both normal linear dirty page
> > > migration and procesing UFFD page fault events.
> > > 
> > > Processing write faults includes reading UFFD file descriptor,
> > > finding respective RAM block and saving faulting page to
> > > the migration stream. After page has been saved, write protection
> > > can be removed. Since asynchronous version of qemu_put_buffer()
> > > is expected to be used to save pages, we also have to flush
> > > migraion stream prior to un-protecting saved memory range.
> > > 
> > > Write protection is being removed for any previously protected
> > > memory chunk that has hit the migration stream. That's valid
> > > for pages from linear page scan along with write fault pages.
> > > 
> > > Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> > > ---
> > >   migration/ram.c | 124 ++++++++++++++++++++++++++++++++++++++++++++----
> > >   1 file changed, 115 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index 7f273c9996..08a1d7a252 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -314,6 +314,8 @@ struct RAMState {
> > >       ram_addr_t last_page;
> > >       /* last ram version we have seen */
> > >       uint32_t last_version;
> > > +    /* 'write-tracking' migration is enabled */
> > > +    bool ram_wt_enabled;
> > >       /* We are in the first round */
> > >       bool ram_bulk_stage;
> > >       /* The free page optimization is enabled */
> > > @@ -574,8 +576,6 @@ static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
> > >       return 0;
> > >   }
> > > -__attribute__ ((unused))
> > > -static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
> > >   __attribute__ ((unused))
> > >   static bool uffd_poll_events(int uffd, int tmo);
> > > @@ -1929,6 +1929,86 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
> > >       return pages;
> > >   }
> > > +/**
> > > + * ram_find_block_by_host_address: find RAM block containing host page
> > > + *
> > > + * Returns true if RAM block is found and pss->block/page are
> > > + * pointing to the given host page, false in case of an error
> > > + *
> > > + * @rs: current RAM state
> > > + * @pss: page-search-status structure
> > > + */
> > > +static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
> > > +        hwaddr page_address)
> > > +{
> > > +    bool found = false;
> > > +
> > > +    pss->block = rs->last_seen_block;
> > > +    do {
> > > +        if (page_address >= (hwaddr) pss->block->host &&
> > > +            (page_address + TARGET_PAGE_SIZE) <=
> > > +                    ((hwaddr) pss->block->host + pss->block->used_length)) {
> > > +            pss->page = (unsigned long)
> > > +                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
> > > +            found = true;
> > > +            break;
> > > +        }
> > > +
> > > +        pss->block = QLIST_NEXT_RCU(pss->block, next);
> > > +        if (!pss->block) {
> > > +            /* Hit the end of the list */
> > > +            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
> > > +        }
> > > +    } while (pss->block != rs->last_seen_block);
> > > +
> > > +    rs->last_seen_block = pss->block;
> > > +    /*
> > > +     * Since we are in the same loop with ram_find_and_save_block(),
> > > +     * need to reset pss->complete_round after switching to
> > > +     * other block/page in pss.
> > > +     */
> > > +    pss->complete_round = false;
> > > +
> > > +    return found;
> > > +}
> > > +
> > > +/**
> > > + * get_fault_page: try to get next UFFD write fault page and, if pending fault
> > > + *   is found, put info about RAM block and block page into pss structure
> > > + *
> > > + * Returns true in case of UFFD write fault detected, false otherwise
> > > + *
> > > + * @rs: current RAM state
> > > + * @pss: page-search-status structure
> > > + *
> > > + */
> > > +static bool get_fault_page(RAMState *rs, PageSearchStatus *pss)
> > > +{
> > > +    struct uffd_msg uffd_msg;
> > > +    hwaddr page_address;
> > > +    int res;
> > > +
> > > +    if (!rs->ram_wt_enabled) {
> > > +        return false;
> > > +    }
> > > +
> > > +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
> > > +    if (res <= 0) {
> > > +        return false;
> > > +    }
> > > +
> > > +    page_address = uffd_msg.arg.pagefault.address;
> > > +    if (!ram_find_block_by_host_address(rs, pss, page_address)) {
> > > +        /* In case we couldn't find respective block, just unprotect faulting page */
> > > +        uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false);
> > > +        error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
> > > +                     page_address);
> > > +        return false;
> > > +    }
> > > +
> > > +    return true;
> > > +}
> > > +
> > >   /**
> > >    * ram_find_and_save_block: finds a dirty page and sends it to f
> > >    *
> > > @@ -1955,25 +2035,50 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
> > >           return pages;
> > >       }
> > > +    if (!rs->last_seen_block) {
> > > +        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
> > > +    }
> > >       pss.block = rs->last_seen_block;
> > >       pss.page = rs->last_page;
> > >       pss.complete_round = false;
> > > -    if (!pss.block) {
> > > -        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
> > > -    }
> > > -
> > >       do {
> > > +        ram_addr_t page;
> > > +        ram_addr_t page_to;
> > > +
> > >           again = true;
> > > -        found = get_queued_page(rs, &pss);
> > > -
> > > +        /* In case of 'write-tracking' migration we first try
> > > +         * to poll UFFD and get write page fault event */
> > > +        found = get_fault_page(rs, &pss);
> > > +        if (!found) {
> > > +            /* No trying to fetch something from the priority queue */
> > > +            found = get_queued_page(rs, &pss);
> > > +        }
> > >           if (!found) {
> > >               /* priority queue empty, so just search for something dirty */
> > >               found = find_dirty_block(rs, &pss, &again);
> > >           }
> > >           if (found) {
> > > +            page = pss.page;
> > >               pages = ram_save_host_page(rs, &pss, last_stage);
> > > +            page_to = pss.page;
> > > +
> > > +            /* Check if page is from UFFD-managed region */
> > > +            if (pss.block->flags & RAM_UF_WRITEPROTECT) {
> > > +                hwaddr page_address = (hwaddr) pss.block->host +
> > > +                        ((hwaddr) page << TARGET_PAGE_BITS);
> > > +                hwaddr run_length = (hwaddr) (page_to - page + 1) << TARGET_PAGE_BITS;
> > > +                int res;
> > > +
> > > +                /* Flush async buffers before un-protect */
> > > +                qemu_fflush(rs->f);
> > > +                /* Un-protect memory range */
> > > +                res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false);
> > > +                if (res < 0) {
> > > +                    break;
> > > +                }
> > > +            }
> > 
> > Please separate that out into a separate function.
> > 
> > Dave
> > 
> 
> You mean separate implementation of ram_find_and_save_block()?

No, I mean take that if (...WRITEPROTECT) { ..} block and put it in a
function and call it from there, just to keep ram_find_and_save_block
clean.

Dave

> Andrey
> 
> > >           }
> > >       } while (!pages && again);
> > > @@ -2086,7 +2191,8 @@ static void ram_state_reset(RAMState *rs)
> > >       rs->last_sent_block = NULL;
> > >       rs->last_page = 0;
> > >       rs->last_version = ram_list.version;
> > > -    rs->ram_bulk_stage = true;
> > > +    rs->ram_wt_enabled = migrate_track_writes_ram();
> > > +    rs->ram_bulk_stage = !rs->ram_wt_enabled;
> > >       rs->fpo_enabled = false;
> > >   }
> > > -- 
> > > 2.25.1
> > > 
> 
> 
> -- 
> Andrey Gruzdev, Principal Engineer
> Virtuozzo GmbH  +7-903-247-6397
>                 virtuzzo.com
>
Andrey Gruzdev Nov. 25, 2020, 7:12 p.m. UTC | #14
On 25.11.2020 21:41, Dr. David Alan Gilbert wrote:
> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>> On 25.11.2020 16:08, Dr. David Alan Gilbert wrote:
>>> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>>>> In this particular implementation the same single migration
>>>> thread is responsible for both normal linear dirty page
>>>> migration and procesing UFFD page fault events.
>>>>
>>>> Processing write faults includes reading UFFD file descriptor,
>>>> finding respective RAM block and saving faulting page to
>>>> the migration stream. After page has been saved, write protection
>>>> can be removed. Since asynchronous version of qemu_put_buffer()
>>>> is expected to be used to save pages, we also have to flush
>>>> migraion stream prior to un-protecting saved memory range.
>>>>
>>>> Write protection is being removed for any previously protected
>>>> memory chunk that has hit the migration stream. That's valid
>>>> for pages from linear page scan along with write fault pages.
>>>>
>>>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>>>> ---
>>>>    migration/ram.c | 124 ++++++++++++++++++++++++++++++++++++++++++++----
>>>>    1 file changed, 115 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/migration/ram.c b/migration/ram.c
>>>> index 7f273c9996..08a1d7a252 100644
>>>> --- a/migration/ram.c
>>>> +++ b/migration/ram.c
>>>> @@ -314,6 +314,8 @@ struct RAMState {
>>>>        ram_addr_t last_page;
>>>>        /* last ram version we have seen */
>>>>        uint32_t last_version;
>>>> +    /* 'write-tracking' migration is enabled */
>>>> +    bool ram_wt_enabled;
>>>>        /* We are in the first round */
>>>>        bool ram_bulk_stage;
>>>>        /* The free page optimization is enabled */
>>>> @@ -574,8 +576,6 @@ static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
>>>>        return 0;
>>>>    }
>>>> -__attribute__ ((unused))
>>>> -static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
>>>>    __attribute__ ((unused))
>>>>    static bool uffd_poll_events(int uffd, int tmo);
>>>> @@ -1929,6 +1929,86 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>>>>        return pages;
>>>>    }
>>>> +/**
>>>> + * ram_find_block_by_host_address: find RAM block containing host page
>>>> + *
>>>> + * Returns true if RAM block is found and pss->block/page are
>>>> + * pointing to the given host page, false in case of an error
>>>> + *
>>>> + * @rs: current RAM state
>>>> + * @pss: page-search-status structure
>>>> + */
>>>> +static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
>>>> +        hwaddr page_address)
>>>> +{
>>>> +    bool found = false;
>>>> +
>>>> +    pss->block = rs->last_seen_block;
>>>> +    do {
>>>> +        if (page_address >= (hwaddr) pss->block->host &&
>>>> +            (page_address + TARGET_PAGE_SIZE) <=
>>>> +                    ((hwaddr) pss->block->host + pss->block->used_length)) {
>>>> +            pss->page = (unsigned long)
>>>> +                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
>>>> +            found = true;
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        pss->block = QLIST_NEXT_RCU(pss->block, next);
>>>> +        if (!pss->block) {
>>>> +            /* Hit the end of the list */
>>>> +            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
>>>> +        }
>>>> +    } while (pss->block != rs->last_seen_block);
>>>> +
>>>> +    rs->last_seen_block = pss->block;
>>>> +    /*
>>>> +     * Since we are in the same loop with ram_find_and_save_block(),
>>>> +     * need to reset pss->complete_round after switching to
>>>> +     * other block/page in pss.
>>>> +     */
>>>> +    pss->complete_round = false;
>>>> +
>>>> +    return found;
>>>> +}
>>>> +
>>>> +/**
>>>> + * get_fault_page: try to get next UFFD write fault page and, if pending fault
>>>> + *   is found, put info about RAM block and block page into pss structure
>>>> + *
>>>> + * Returns true in case of UFFD write fault detected, false otherwise
>>>> + *
>>>> + * @rs: current RAM state
>>>> + * @pss: page-search-status structure
>>>> + *
>>>> + */
>>>> +static bool get_fault_page(RAMState *rs, PageSearchStatus *pss)
>>>> +{
>>>> +    struct uffd_msg uffd_msg;
>>>> +    hwaddr page_address;
>>>> +    int res;
>>>> +
>>>> +    if (!rs->ram_wt_enabled) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
>>>> +    if (res <= 0) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    page_address = uffd_msg.arg.pagefault.address;
>>>> +    if (!ram_find_block_by_host_address(rs, pss, page_address)) {
>>>> +        /* In case we couldn't find respective block, just unprotect faulting page */
>>>> +        uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false);
>>>> +        error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
>>>> +                     page_address);
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>>    /**
>>>>     * ram_find_and_save_block: finds a dirty page and sends it to f
>>>>     *
>>>> @@ -1955,25 +2035,50 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>>>>            return pages;
>>>>        }
>>>> +    if (!rs->last_seen_block) {
>>>> +        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
>>>> +    }
>>>>        pss.block = rs->last_seen_block;
>>>>        pss.page = rs->last_page;
>>>>        pss.complete_round = false;
>>>> -    if (!pss.block) {
>>>> -        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
>>>> -    }
>>>> -
>>>>        do {
>>>> +        ram_addr_t page;
>>>> +        ram_addr_t page_to;
>>>> +
>>>>            again = true;
>>>> -        found = get_queued_page(rs, &pss);
>>>> -
>>>> +        /* In case of 'write-tracking' migration we first try
>>>> +         * to poll UFFD and get write page fault event */
>>>> +        found = get_fault_page(rs, &pss);
>>>> +        if (!found) {
>>>> +            /* No trying to fetch something from the priority queue */
>>>> +            found = get_queued_page(rs, &pss);
>>>> +        }
>>>>            if (!found) {
>>>>                /* priority queue empty, so just search for something dirty */
>>>>                found = find_dirty_block(rs, &pss, &again);
>>>>            }
>>>>            if (found) {
>>>> +            page = pss.page;
>>>>                pages = ram_save_host_page(rs, &pss, last_stage);
>>>> +            page_to = pss.page;
>>>> +
>>>> +            /* Check if page is from UFFD-managed region */
>>>> +            if (pss.block->flags & RAM_UF_WRITEPROTECT) {
>>>> +                hwaddr page_address = (hwaddr) pss.block->host +
>>>> +                        ((hwaddr) page << TARGET_PAGE_BITS);
>>>> +                hwaddr run_length = (hwaddr) (page_to - page + 1) << TARGET_PAGE_BITS;
>>>> +                int res;
>>>> +
>>>> +                /* Flush async buffers before un-protect */
>>>> +                qemu_fflush(rs->f);
>>>> +                /* Un-protect memory range */
>>>> +                res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false);
>>>> +                if (res < 0) {
>>>> +                    break;
>>>> +                }
>>>> +            }
>>>
>>> Please separate that out into a separate function.
>>>
>>> Dave
>>>
>>
>> You mean separate implementation of ram_find_and_save_block()?
> 
> No, I mean take that if (...WRITEPROTECT) { ..} block and put it in a
> function and call it from there, just to keep ram_find_and_save_block
> clean.
> 
> Dave
> 

Got it, ok.

Andrey

>> Andrey
>>
>>>>            }
>>>>        } while (!pages && again);
>>>> @@ -2086,7 +2191,8 @@ static void ram_state_reset(RAMState *rs)
>>>>        rs->last_sent_block = NULL;
>>>>        rs->last_page = 0;
>>>>        rs->last_version = ram_list.version;
>>>> -    rs->ram_bulk_stage = true;
>>>> +    rs->ram_wt_enabled = migrate_track_writes_ram();
>>>> +    rs->ram_bulk_stage = !rs->ram_wt_enabled;
>>>>        rs->fpo_enabled = false;
>>>>    }
>>>> -- 
>>>> 2.25.1
>>>>
>>
>>
>> -- 
>> Andrey Gruzdev, Principal Engineer
>> Virtuozzo GmbH  +7-903-247-6397
>>                  virtuzzo.com
>>
diff mbox series

Patch

diff --git a/migration/ram.c b/migration/ram.c
index 7f273c9996..08a1d7a252 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -314,6 +314,8 @@  struct RAMState {
     ram_addr_t last_page;
     /* last ram version we have seen */
     uint32_t last_version;
+    /* 'write-tracking' migration is enabled */
+    bool ram_wt_enabled;
     /* We are in the first round */
     bool ram_bulk_stage;
     /* The free page optimization is enabled */
@@ -574,8 +576,6 @@  static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
     return 0;
 }
 
-__attribute__ ((unused))
-static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
 __attribute__ ((unused))
 static bool uffd_poll_events(int uffd, int tmo);
 
@@ -1929,6 +1929,86 @@  static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
     return pages;
 }
 
+/**
+ * ram_find_block_by_host_address: find RAM block containing host page
+ *
+ * Returns true if RAM block is found and pss->block/page are
+ * pointing to the given host page, false in case of an error
+ *
+ * @rs: current RAM state
+ * @pss: page-search-status structure
+ */
+static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
+        hwaddr page_address)
+{
+    bool found = false;
+
+    pss->block = rs->last_seen_block;
+    do {
+        if (page_address >= (hwaddr) pss->block->host &&
+            (page_address + TARGET_PAGE_SIZE) <=
+                    ((hwaddr) pss->block->host + pss->block->used_length)) {
+            pss->page = (unsigned long)
+                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
+            found = true;
+            break;
+        }
+
+        pss->block = QLIST_NEXT_RCU(pss->block, next);
+        if (!pss->block) {
+            /* Hit the end of the list */
+            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
+        }
+    } while (pss->block != rs->last_seen_block);
+
+    rs->last_seen_block = pss->block;
+    /*
+     * Since we are in the same loop with ram_find_and_save_block(),
+     * need to reset pss->complete_round after switching to
+     * other block/page in pss.
+     */
+    pss->complete_round = false;
+
+    return found;
+}
+
+/**
+ * get_fault_page: try to get next UFFD write fault page and, if pending fault
+ *   is found, put info about RAM block and block page into pss structure
+ *
+ * Returns true in case of UFFD write fault detected, false otherwise
+ *
+ * @rs: current RAM state
+ * @pss: page-search-status structure
+ *
+ */
+static bool get_fault_page(RAMState *rs, PageSearchStatus *pss)
+{
+    struct uffd_msg uffd_msg;
+    hwaddr page_address;
+    int res;
+
+    if (!rs->ram_wt_enabled) {
+        return false;
+    }
+
+    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
+    if (res <= 0) {
+        return false;
+    }
+
+    page_address = uffd_msg.arg.pagefault.address;
+    if (!ram_find_block_by_host_address(rs, pss, page_address)) {
+        /* In case we couldn't find respective block, just unprotect faulting page */
+        uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false);
+        error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
+                     page_address);
+        return false;
+    }
+
+    return true;
+}
+
 /**
  * ram_find_and_save_block: finds a dirty page and sends it to f
  *
@@ -1955,25 +2035,50 @@  static int ram_find_and_save_block(RAMState *rs, bool last_stage)
         return pages;
     }
 
+    if (!rs->last_seen_block) {
+        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
+    }
     pss.block = rs->last_seen_block;
     pss.page = rs->last_page;
     pss.complete_round = false;
 
-    if (!pss.block) {
-        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
-    }
-
     do {
+        ram_addr_t page;
+        ram_addr_t page_to;
+
         again = true;
-        found = get_queued_page(rs, &pss);
-
+        /* In case of 'write-tracking' migration we first try
+         * to poll UFFD and get write page fault event */
+        found = get_fault_page(rs, &pss);
+        if (!found) {
+            /* No trying to fetch something from the priority queue */
+            found = get_queued_page(rs, &pss);
+        }
         if (!found) {
             /* priority queue empty, so just search for something dirty */
             found = find_dirty_block(rs, &pss, &again);
         }
 
         if (found) {
+            page = pss.page;
             pages = ram_save_host_page(rs, &pss, last_stage);
+            page_to = pss.page;
+
+            /* Check if page is from UFFD-managed region */
+            if (pss.block->flags & RAM_UF_WRITEPROTECT) {
+                hwaddr page_address = (hwaddr) pss.block->host +
+                        ((hwaddr) page << TARGET_PAGE_BITS);
+                hwaddr run_length = (hwaddr) (page_to - page + 1) << TARGET_PAGE_BITS;
+                int res;
+
+                /* Flush async buffers before un-protect */
+                qemu_fflush(rs->f);
+                /* Un-protect memory range */
+                res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false);
+                if (res < 0) {
+                    break;
+                }
+            }
         }
     } while (!pages && again);
 
@@ -2086,7 +2191,8 @@  static void ram_state_reset(RAMState *rs)
     rs->last_sent_block = NULL;
     rs->last_page = 0;
     rs->last_version = ram_list.version;
-    rs->ram_bulk_stage = true;
+    rs->ram_wt_enabled = migrate_track_writes_ram();
+    rs->ram_bulk_stage = !rs->ram_wt_enabled;
     rs->fpo_enabled = false;
 }