diff mbox series

[v3,7/7] introduce simple linear scan rate limiting mechanism

Message ID 20201119125940.20017-8-andrey.gruzdev@virtuozzo.com (mailing list archive)
State New, archived
Headers show
Series UFFD write-tracking migration/snapshots | expand

Commit Message

Andrey Gruzdev Nov. 19, 2020, 12:59 p.m. UTC
Since reading UFFD events and saving paged data are performed
from the same thread, write fault latencies are sensitive to
migration stream stalls. Limiting total page saving rate is a
method to reduce amount of noticiable fault resolution latencies.

Migration bandwidth limiting is achieved via noticing cases of
out-of-threshold write fault latencies and temporarily disabling
(strictly speaking, severely throttling) saving non-faulting pages.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 migration/ram.c | 58 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 54 insertions(+), 4 deletions(-)

Comments

Peter Xu Nov. 19, 2020, 8:02 p.m. UTC | #1
On Thu, Nov 19, 2020 at 03:59:40PM +0300, Andrey Gruzdev wrote:
> Since reading UFFD events and saving paged data are performed
> from the same thread, write fault latencies are sensitive to
> migration stream stalls. Limiting total page saving rate is a
> method to reduce amount of noticiable fault resolution latencies.
> 
> Migration bandwidth limiting is achieved via noticing cases of
> out-of-threshold write fault latencies and temporarily disabling
> (strictly speaking, severely throttling) saving non-faulting pages.

Just curious: have you measured aver/max latency of wr-protected page requests,
or better, even its distribution?

I believe it should also be relevant to where the snapshot is stored, say, the
backend disk of your tests.  Is that a file on some fs?

I would expect the latency should be still good if e.g. the throughput of the
backend file system is decent even without a patch like this, but I might have
missed something..

In all cases, it would be very nice if this patch can have the histogram or
aver or max latency measured and compared before/after this patch applied.

Thanks,
Andrey Gruzdev Nov. 20, 2020, 12:06 p.m. UTC | #2
On 19.11.2020 23:02, Peter Xu wrote:
> On Thu, Nov 19, 2020 at 03:59:40PM +0300, Andrey Gruzdev wrote:
>> Since reading UFFD events and saving paged data are performed
>> from the same thread, write fault latencies are sensitive to
>> migration stream stalls. Limiting total page saving rate is a
>> method to reduce amount of noticiable fault resolution latencies.
>>
>> Migration bandwidth limiting is achieved via noticing cases of
>> out-of-threshold write fault latencies and temporarily disabling
>> (strictly speaking, severely throttling) saving non-faulting pages.
> 
> Just curious: have you measured aver/max latency of wr-protected page requests,
> or better, even its distribution?
> 
> I believe it should also be relevant to where the snapshot is stored, say, the
> backend disk of your tests.  Is that a file on some fs?
> 
> I would expect the latency should be still good if e.g. the throughput of the
> backend file system is decent even without a patch like this, but I might have
> missed something..
> 
> In all cases, it would be very nice if this patch can have the histogram or
> aver or max latency measured and compared before/after this patch applied.
> 
> Thanks,
> 
So far I have no objective latency measurements, that really 
interesting. For testing I commonly used SATA HDD, not too fresh one, 
1.5TB Seagate 7200.11 series, specially not to have large hardware cache 
or flash buffer. And yes, backend is a file on ext4.

I tried mostly with 'migrate exec:streamdump_utility', a very simple 
tool which writes stream to a file. It even doesn't use AIO - so reads 
from STDIN and file writes don't overlap. Made so intentionally to 
reduce throughput and give some random high-latency writes.

I think snapshotting performance may be severely degraded by I/O 
happening in parallel on the same storage media, that's the problem we 
need to consider.

And yes, to take latency histogram before/after the patch is nice idea!
Also I need to make stream dumping tool with AIO, to test with full 
storage throughput.
Peter Xu Nov. 20, 2020, 3:23 p.m. UTC | #3
On Fri, Nov 20, 2020 at 03:06:56PM +0300, Andrey Gruzdev wrote:
> On 19.11.2020 23:02, Peter Xu wrote:
> > On Thu, Nov 19, 2020 at 03:59:40PM +0300, Andrey Gruzdev wrote:
> > > Since reading UFFD events and saving paged data are performed
> > > from the same thread, write fault latencies are sensitive to
> > > migration stream stalls. Limiting total page saving rate is a
> > > method to reduce amount of noticiable fault resolution latencies.
> > > 
> > > Migration bandwidth limiting is achieved via noticing cases of
> > > out-of-threshold write fault latencies and temporarily disabling
> > > (strictly speaking, severely throttling) saving non-faulting pages.
> > 
> > Just curious: have you measured aver/max latency of wr-protected page requests,
> > or better, even its distribution?
> > 
> > I believe it should also be relevant to where the snapshot is stored, say, the
> > backend disk of your tests.  Is that a file on some fs?
> > 
> > I would expect the latency should be still good if e.g. the throughput of the
> > backend file system is decent even without a patch like this, but I might have
> > missed something..
> > 
> > In all cases, it would be very nice if this patch can have the histogram or
> > aver or max latency measured and compared before/after this patch applied.
> > 
> > Thanks,
> > 
> So far I have no objective latency measurements, that really interesting.
> For testing I commonly used SATA HDD, not too fresh one, 1.5TB Seagate
> 7200.11 series, specially not to have large hardware cache or flash buffer.
> And yes, backend is a file on ext4.
> 
> I tried mostly with 'migrate exec:streamdump_utility', a very simple tool
> which writes stream to a file. It even doesn't use AIO - so reads from STDIN
> and file writes don't overlap. Made so intentionally to reduce throughput
> and give some random high-latency writes.
> 
> I think snapshotting performance may be severely degraded by I/O happening
> in parallel on the same storage media, that's the problem we need to
> consider.

Yeah this is a good point.  So ideally I think here we have similar requirement
with postcopy in that we really want the postcopy request stream to be separate
from the main migration stream, just like what we'd probably prefer here too
when vcpus writting to wr-protected pages, rather than missing ones.  So that
the urgent requests got handled faster than the background task.

For postcopy, we may still have some chance if someday we can separate the
migration channels, so that we can try to provide a standalone channel for
postcopy requests.  For example if tcp based, that can be a separate socket.
Then we can try to tune the priority of these sockets to make sure postcopy
stream were sent earlier.

So I guess it's not always easy to seperate these two channels - in your case
if the disk must be shared, seems not easy.  Maybe we can prioritize disk I/Os
somehow too just like tcp channels?  Not sure.  Hmm, it would be good if
QEMUFile could provide some mechanism to prioritize IOs as a common interface,
but that's probably too far beyond our discussion. :)

> 
> And yes, to take latency histogram before/after the patch is nice idea!
> Also I need to make stream dumping tool with AIO, to test with full storage
> throughput.

Yeah I would appreciate if you can provide those.

Note that I think it can even be some work after the functionality got merged
(Dave/Juan could correct me), because it can be some work on top.  Having them
together would be even nicer, of course.  Your call.  It's already much better
than stopping the whole vm, imho.

Thanks,
diff mbox series

Patch

diff --git a/migration/ram.c b/migration/ram.c
index 08a1d7a252..89fe106585 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -325,6 +325,10 @@  struct RAMState {
     /* these variables are used for bitmap sync */
     /* last time we did a full bitmap_sync */
     int64_t time_last_bitmap_sync;
+    /* last time UFFD fault occured */
+    int64_t last_fault_ns;
+    /* linear scan throttling counter */
+    int throttle_skip_counter;
     /* bytes transferred at start_time */
     uint64_t bytes_xfer_prev;
     /* number of dirty pages since start_time */
@@ -576,9 +580,6 @@  static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
     return 0;
 }
 
-__attribute__ ((unused))
-static bool uffd_poll_events(int uffd, int tmo);
-
 /**
  * uffd_read_events: read pending UFFD events
  *
@@ -2006,9 +2007,51 @@  static bool get_fault_page(RAMState *rs, PageSearchStatus *pss)
         return false;
     }
 
+    rs->last_fault_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     return true;
 }
 
+#define FAULT_HIGH_LATENCY_NS   5000000     /* 5 ms */
+#define SLOW_FAULT_POLL_TMO     5           /* 5 ms */
+#define SLOW_FAULT_SKIP_PAGES   200
+
+/**
+ * limit_scan_rate: limit RAM linear scan rate in case of growing write fault
+ *  latencies, used in write-tracking migration implementation
+ *
+ * @rs: current RAM state
+ *
+ */
+static void limit_scan_rate(RAMState *rs)
+{
+    int64_t last_fault_latency_ns = 0;
+
+    if (!rs->ram_wt_enabled) {
+        return;
+    }
+
+    /* Check if last write fault time is available */
+    if (rs->last_fault_ns) {
+        last_fault_latency_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) -
+                rs->last_fault_ns;
+        rs->last_fault_ns = 0;
+    }
+
+    /* In case last fault time was available and we have
+     * latency value, check if it's not too high */
+    if (last_fault_latency_ns > FAULT_HIGH_LATENCY_NS) {
+        /* Reset counter after each slow write fault */
+        rs->throttle_skip_counter = SLOW_FAULT_SKIP_PAGES;
+    }
+    /* Delay thread execution till next write fault occures or timeout expires.
+     * Next SLOW_FAULT_SKIP_PAGES can be write fault pages only, not from pages going from
+     * linear scan logic. Thus we moderate migration stream rate to reduce latencies */
+    if (rs->throttle_skip_counter > 0) {
+        uffd_poll_events(rs->uffdio_fd, SLOW_FAULT_POLL_TMO);
+        rs->throttle_skip_counter--;
+    }
+}
+
 /**
  * ram_find_and_save_block: finds a dirty page and sends it to f
  *
@@ -2078,6 +2121,9 @@  static int ram_find_and_save_block(RAMState *rs, bool last_stage)
                 if (res < 0) {
                     break;
                 }
+
+                /* Linear scan rate limiting */
+                limit_scan_rate(rs);
             }
         }
     } while (!pages && again);
@@ -2191,12 +2237,15 @@  static void ram_state_reset(RAMState *rs)
     rs->last_sent_block = NULL;
     rs->last_page = 0;
     rs->last_version = ram_list.version;
+    rs->last_fault_ns = 0;
+    rs->throttle_skip_counter = 0;
     rs->ram_wt_enabled = migrate_track_writes_ram();
     rs->ram_bulk_stage = !rs->ram_wt_enabled;
     rs->fpo_enabled = false;
 }
 
 #define MAX_WAIT 50 /* ms, half buffered_file limit */
+#define WT_MAX_WAIT 1000 /* 1000 ms, need bigger limit for 'write-tracking' migration */
 
 /*
  * 'expected' is the value you expect the bitmap mostly to be full
@@ -2872,7 +2921,8 @@  static int ram_save_iterate(QEMUFile *f, void *opaque)
             if ((i & 63) == 0) {
                 uint64_t t1 = (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - t0) /
                               1000000;
-                if (t1 > MAX_WAIT) {
+                uint64_t max_wait = rs->ram_wt_enabled ? WT_MAX_WAIT : MAX_WAIT;
+                if (t1 > max_wait) {
                     trace_ram_save_iterate_big_wait(t1, i);
                     break;
                 }